Big Data Analytics and Processing Platform in Czech Republic Healthcare

: Big data analytics (BDA) in healthcare has made a positive di ﬀ erence in the integration of Artiﬁcial Intelligence (AI) in advancements of analytical capabilities, while lowering the costs of medical care. The aim of this study is to improve the existing healthcare eSystem by implementing a Big Data Analytics (BDA) platform and to meet the requirements of the Czech Republic National Health Service (Tender-Id. VZ0036628, No. Z2017-035520). In addition to providing analytical capabilities on Linux platforms supporting current and near-future AI with machine-learning and data-mining algorithms, there is the need for ethical considerations mandating new ways to preserve privacy, all of which are preconditioned by the growing body of regulations and expectations. The presented BDA platform, has met all requirements (N > 100), including the healthcare industry-standard Transaction Processing Performance Council (TPC-H) decision support benchmark in compliance with the European Union (EU) and the Czech Republic legislations. Currently, the presented Proof of Concept (PoC) that has been upgraded to a production environment has uniﬁed isolated parts of Czech Republic healthcare over the past seven months. The reported PoC BDA platform, artefacts, and concepts are transferrable to healthcare systems in other countries interested in developing or upgrading their own national healthcare infrastructure in a cost-e ﬀ ective, secure, scalable and high-performance manner.


Introduction
Big data has influenced the ways we collect, manage, analyse, visualise, and utilise data. For healthcare, on adopting an eSystem with implemented big data analytics (BDA), there is an expectation that modern, robust, high-performance and cost-effective BDA technologies can preserve patient privacy, while enhancing data-driven support for medical staff, as well as the broader patient population. Currently, the Czech Republic is in the process of adopting and incrementally upgrading their healthcare eSystem, leveraging BDA to enhance the quality of care with integrated national and regional support.
The scope of this paper is to report on the prerequisite factors, and tests influencing the implementation of the BDA platform with the performance required to support the national strategy for BDA adoption in the Czech Republic healthcare system. The reported healthcare solution had to pass more than 100 complex requirements (N = 119, including 13 bonus features), pre-requisites, and system conditions that were tested on the proposed platform, in compliance with European Union and Czech Republic regulations as well as the Transaction Processing Performance Council (TPC-H) benchmarks [1]. In the authors' view, which is aligned with global trends and EU initiatives [2][3][4][5][6]: (1) the growing amount of data in healthcare, data-streaming IoT devices, and mobile apps have made the adoption of BDA technologies inevitable for modern society; (2) combining BDA, data mining, and AI with healthcare applications is a crucial step in advancing towards the next generation of healthcare eSystems [7,8]; and (3) the BDA platform implementation in one of the EU member states will shape decisions regarding replicability and knowledge transfer for other EU members who are in the process of transforming their healthcare systems [2]. Big data technologies have been adopted in many industries such as transport, banking, automotive, insurance, media, education, and healthcare [9][10][11][12][13]. Common to the exponential trend of Internet network traffic, the volume of data produced every day is also increasing exponentially in modern healthcare [14]. When the volume of data grows beyond a certain limit, traditional systems and methodologies can no longer cope with data processing demands or transform data into a format for the task required. Traditionally, small data portions as parts of online transaction processing (OLTP) systems are collected in a controlled manner, known as short atomic transactions [15]. In contrast, for big data clustered environments, there are stream and batch data processing demands, all requiring more flexibility for various data distribution patterns and matching eSystems scalability [16].
Typically, for big-data eSystems, stream-processing is concerned with (near) real-time analytics and data prediction, while batch data processing deals with implementing complex business logic with advanced and specialised algorithms.
Small data systems typically scale vertically by adding more resources to the same machine; this can be costly and eventually reach maximum possible upgrades. Contrastingly, big data systems are cluster-based and therefore depend mostly on horizontally scalable architecture, which in the long run provides increased performance efficiency at a lower cost by employing commodity hardware.

Big Data Technology Perspective
The idea of applying big data clusters to process and analyse healthcare data is not new [17][18][19][20]. For example, in 2009, early experiments conducted on a 100-node cluster with a set of benchmarks, revealed various trade-offs in performance for selected parallel systems to store and process data intended for healthcare use [21].
Recently, there has been a growing interest and need for eSystem platforms and cloud-based technologies, emphasising new and innovative big data tools employing various data mining, machine learning [22], and other AI-based techniques that could enable knowledge discovery, personalised patient-cantered modelling, identification of groups sharing similar characteristics, predictive analytics, improved drug safety, and enhanced diagnostic capabilities.

Challenges and Opportunities
The integration and governance of big data technologies in healthcare has local and global implications in terms of challenges and opportunities [6,8,12,23]. Challenges in healthcare include "issues of data structure, security, data standardisation, storage and transfers, and managerial skills such as data governance" [24].
For advancements in healthcare services, implementing BDA platforms combined with data analytics [25], have the potential to: • improve the quality of personalised care and medical services; • reduce cost of treatment; • use predictive analytics for e.g., patients' daily (loss of) income and disease progression; • use real-time visualisation and analytics for immediate care and cases of readmission; • enhance patient engagement with their own healthcare provider via processing satisfaction evaluation data and self-reported health status [26,27]; • to integrate small-data analytics and knowledge discovery that may also be integrated with big data [28]; • to integrate video, motion sensors, 2D/3D kinematic and other privacy-preserving motion data for human motion modelling and analysis (HMMA), linking active life, well-being, and health benefits [29][30][31][32][33][34]; • provide near real time outbreaks geo-mapping information, facilitate collaboration, community engagement, data transparency, and data exchange [35,36]; and • use healthcare data for identification of trends, strategic planning, governance, improved decision-making, and cost reduction [24].
To enable advancements towards the next generation of BDA platform that can help and improve healthcare outcomes, this study addresses the following questions: i.
Is it possible to design and build a BDA platform for the Czech Republic healthcare service, in line with EU legislation, TPC-H [1] benchmarks, and other statutory requirements? ii.
If so, what BDA platform would provide optimal cost and performance features, while allowing installation of open-source software with various machine learning algorithms, development environments, and commercial visualisation and analytical tools? iii.
To what extend would such a BDA-based eSystem be future-proof for maintaining reliability, robustness, cost-effectiveness, and performance?

Industry Benchmarks
Industry benchmarks have an important role in advancing design and engineering solutions in database systems. For example, the Transaction Processing Performance Council (TPC) [1], has an important role in encouraging the adoption of industry benchmarks in computing, which are today widely used by many leading vendors to demonstrate their products' performance. Similarly, large buyers often use TPC benchmark results [37,38] as a measurable point of comparison between new computing systems and technologies to ensure a high level of performance in their computing environments [39].

Big Data Analytics
Analytical technologies for big data [40] are showing promising results in their attempts to manage ever-expanding data in healthcare. For example, a 2014 Massachusetts Institute of Technology (MIT) study on big data in intensive care units [20] reported findings that data analysis could positively predict critical information, such as duration of hospitalisation, number of patients requiring surgical intervention, and which patients could be at risk of sepsis or iatrogenic diseases. For such patients, data analytics could save lives or prevent other complications that patients might encounter.
Technologies utilising BDA are also being successfully employed outside of hospitals [41]. The medical community and government bodies now recognise the importance of monitoring the incidence of influenza illness using massive data analysis technologies [42]. Seasonal influenza epidemics are a significant problem for public health systems, annually leading to 250,000-500,000 deaths worldwide [43][44][45]. Furthermore, new types viruses against which population lacks immunity can lead to a pandemic with millions of deaths [43]. Early detection of disease activity leads to a faster response and may reduce the impact of both seasonal and pandemic influenza in terms of saving lives or reducing respiratory illnesses on a world-wide scale [43]. One method of early detection is to monitor Internet search behaviour in relation to health queries such as employed by Google [22,45]. In addition, it was discovered that some queries are strongly correlated with the percentage of doctor visits when the patient presents symptoms of influenza. This correlation made it possible for Google to produce an algorithm that estimates influenza activity in different regions of the United States with a one-day delay. Among other algorithms, this approach allows Google to use queries to detect epidemics from influenza-like searches in areas where population has regular access to the Internet.
In light of the recent Coronavirus outbreak and lessons learned from SARS and Ebola [35,36,46,47], BDA eSystems could provide ad-hoc analytics, data exchange, and near-real time geo-mapping functionality for pandemic/epidemic tracking and outbreak spread and risks data visualisation. For pandemic outbreaks, ad-hoc analytics can be considered as human-centric and an active approach to pattern discovery. For example, the chance discovery approach [28] can be combined with available data analytics, machine learning, and data mining approaches. Such human-centric and active approach of reducing sample size and emphasising a relative minority portion can be applied to improve selective screening efficiency of incoming travellers from infected regions [28,46].

SQL vs. NoSQL Approaches
Structured query language (SQL) has been developed for relational databases, while in more recent times, not-only SQL (NoSQL) has been developed for non-relational and distributed databases. Data can be stored and processed in either a row-oriented or column-oriented format. The row-oriented principle based on Codd's relational model is well-established in most database applications [48][49][50][51]. However, such well-established relational database management systems (RDBMS) [50,52] are not efficient for analytical applications that mostly perform create, read, update, and delete operations. Over the last few years, NoSQL [38] databases have been tested and studied, and their performance evaluated in different studies [53,54], where some have focused their assessments on the advantages of the use of NoSQL technologies [55]. For the architects of BDA platforms, known differences between Structured Query Language (SQL) and NoSQL database management systems make designing a challenging task, with a number of decisions to address the purpose and related set of requirements. Newer than SQL, NoSQL databases support the notion of elastic scaling, allowing for new nodes to be added to improve availability, scalability, and fault tolerance [56].
Many of the related works and reviews on big data techniques [57] and technologies used in healthcare rely mostly on silo principles for data integration, data processing, and data visualisation. An application that operates on the columns in the dataset allows overcoming of performance problems with the "NoSQL" or "Not Only SQL" databases [43,45]. These databases can be recognised on premises or in the cloud. Cloud computing [49,58] also offers this database service. NoSQL databases provide elastic and horizontal scaling features, allowing new nodes to be added. New nodes are typically designed on the basis of low-cost (commodity) hardware.
In relation to the main objective of this work, the creation of a real-life platform for big data integration, master data management, ad-hoc analysis, data storing, data processing, and visualisation is based on the NoSQL database for data storing and data processing on Vertica clusters.

Materials and Methods
The dataset used for this cross-sectional study includes an anonymised real-life data sample provided by the Institute of Health Information and Statistics (IHIS) of the Czech Republic. To comply with the two-phase IHIS acceptance testing, the requirement analysis was combined with the design science approach [53] involving the production of a scalable platform, architecture, software, and hardware infrastructure. In context of the tender bid and system procurement rigor, the set of IHIS requirements consists of tender evaluation criteria based on a weighted scoring system, including the total cost of ownership, mandatory requirements (e.g., TPC-H), and 13 bonus features.
The produced solution, based on the cyclic experimental design approach, has met all the requirements, while also achieving the highest performance ranking.
Incremental performance and functionality improvements from the Phase I and Phase I+II evaluations involved: i.
Phase I-compliant big data eSystem requirements and design decisions influencing architecture design; ii.
Decomposition of acceptance testing requirements; iii.
Hardware infrastructure optimisation to a set of the requirements (N = 119) for weighted scoring, including TPC-H decision support and minimal system performance; and iv.
Performance-driven eSystem optimisation from available test datasets. v.

IHIS Requirements
The IHIS requirements can be grouped by the following aspects: • Scalability: the eSystem must allow performance enhancement via additional and accessible computing technology, including commodity hardware products. • Modularity, Openness, and Interoperability: the system components must be integrated via specified interfaces according to exact requirement specifications. It is also essential that a wide variety of vendors can readily utilise system components. Extensibility: all tools and components of the eSystem must provide space for future upgrades, including functionality and capability advancements. • Quality Assurance: a tool for validating data and metadata integrity is required to ensure that processed data remains accurate throughout the analysis procedure. • Security: the eSystem must be operable on local servers, without reliance on cloud or outsourced backup systems. It is essential that the eSystem provides security for all data against external or internal threats. Therefore, authorisation, storage access and communication are of utmost concern. User access rights had to be set to the database, table, or column level to restrict data access to a limited number of advanced users. The eSystem must log all executions and read operations for future audits. The eSystem must support tools for version control and development, while meeting the requirements for metadata and data versioning, backup, and archiving.

•
Simplicity: the eSystem must allow for parallel team collaboration on all processes, data flow, and database schemas. All tasks must be fully editable, allowing commit and revert changes in data and metadata. It is essential that the eSystem be simple and easy to use, as well as stable and resilient to subsystem outages. • Performance: the eSystem must be designed for the specified minimum number of concurrent users. Batch processing of data sources and sophisticated data mining analyses are considered essential. Complete data integration processing of quarterly data increments must not exceed one hour.
The most important IHIS requirements mandate that: i. All tools, licenses, and environmental features used in the Proof of Concept (PoC) tests must match the eSystem offer submitted and documented in the public contract. To meet contractual obligations, the proposed solution cannot have: insincerely increased system performance, altered available license terms or otherwise improved or modified results vis-à-vis the delivery of the final solution. The environment configuration must satisfy the general requirements of the proposed eSystem (usage types, input data size, processing speed requirements). ii.
The proposed eSystem cannot be explicitly (manually) optimised for specific queries and individual task steps within a test. The test queries are not to be based on general metadata (cache, partitioning, supplemental indexes, derived tables, and views), except in exceptional cases where optimising the loading of large amounts of data is needed. The techniques based on general metadata may be used in future for enhancing performance but are not required Appl. Sci. 2020, 10, 1705 6 of 23 as a precondition for system availability. To load large amounts of data, the environment configuration can be manually adjusted to a non-standard configuration for further test steps ( Figure 1). iii.
The configuration must not be manually changed during the test to optimise individual tasks-the eSystem is required to be universal for tasks that may overlap in time.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 6 of 23 The configuration must not be manually changed during the test to optimise individual tasks-the eSystem is required to be universal for tasks that may overlap in time.

TPC-H Performance Requirements and Tests
Meeting TPC-H benchmarks involves testing for minimum requirements, including a set of values and parameters. Standardised test conditions specified in the TPC-H Benchmark™ are available online (http://www.tpc.org/tpch). The IHIS requires that any proposed eSystem meets performance metrics aligned with standard TPC-H workloads during developmental phase testing ( Figure 1). For data storage benchmarks, the data must be stored on independent disks, with a replication factor more significant than two. The solution must also support best practices regarding data security and data protection, including hot backup, cold backup, and recovery. Table 1 shows predefined parameters by IHIS for their initial test databases (1 TB and 3 TB), and values for the power tests in the first run (after system cold restart) and second run (after database restart). Before starting and testing the TPC-H benchmarks, optimising of the system for specific queries (such as manual or another non-standard optimisation) is not permitted.  (Table 2); **-TPC-H benchmark ( Table 3).
The TPC-H tests are to emulate future production eSystem behaviour. For all contenders (Tender Id. VZ0036628, No. Z2017-035520), supplied test data consisted of simulated medical documentation

TPC-H Performance Requirements and Tests
Meeting TPC-H benchmarks involves testing for minimum requirements, including a set of values and parameters. Standardised test conditions specified in the TPC-H Benchmark™ are available online (http://www.tpc.org/tpch). The IHIS requires that any proposed eSystem meets performance metrics aligned with standard TPC-H workloads during developmental phase testing ( Figure 1).
For data storage benchmarks, the data must be stored on independent disks, with a replication factor more significant than two. The solution must also support best practices regarding data security and data protection, including hot backup, cold backup, and recovery. Table 1 shows predefined parameters by IHIS for their initial test databases (1 TB and 3 TB), and values for the power tests in the first run (after system cold restart) and second run (after database restart). Before starting and testing the TPC-H benchmarks, optimising of the system for specific queries (such as manual or another non-standard optimisation) is not permitted.   The TPC-H tests are to emulate future production eSystem behaviour. For all contenders (Tender Id. VZ0036628, No. Z2017-035520), supplied test data consisted of simulated medical documentation records from three fictional insurance companies: three standard quarterly packages (one quarter for each company), plus one correction (simulating a situation where one insurer supplied inadequate data). Data batches (compressed using the ZIP format) were exchanged in real time containing images and structured alphanumerical data in the comma-separated (CSV) format. Standard input data were up to 30GB per packet, amounting to a total 3 TB of data. Test data contained roughly the same number of rows of expected data, but with a reduced number of columns and with added redundant attributes to reflect the problem dimensionality and approximate amount of data anticipated. Data related to patient drug use was confirmed with The Anatomical Therapeutic Chemical Classification System [61].
For the purpose of conducting TPC-H Benchmark testing, a contender's MHDA system had to be installed on premises utilising a private network. The on-premises multi-user MHDA system must operate in a parallelised application environment. Once the metadata are loaded, the system was required to run without intervention to prevent any configurations from being manually altered, thus compromising the TPC-H tests' integrity.
For prototype development, testing, and reported results, we installed CentOS Linux (7.3). Our solution (as PoC) met all of the requirements for Massively Parallel Processing (MPP) as an MHDA system. The proposed BDA platform and its architecture also allowed for both remote supports according to the specified Service Level Agreement (SLA) for fault correction, and by the end of Next Business Day (NBD) requirements. After the PoC handover, the chosen, installed, and re-tested operating system on IHIS premises was Red Hat Enterprise Linux Server release 6.8 (Santiago).
For step one of the testing activities (Figure 1), we configured the system architecture with five nodes operating in the Vertica 9.0.1-1 database cluster. For step two, a DBGEN program generated 1 TB or 3 TB databases. At this point, the initial data was loaded into the system and we ran 1 TB and 3 TB power and throughput tests. These tests resulted in records of individual measurements. After testing one cluster, we deleted the data and generated another 3 TB database. We repeated these power tests for each of the five nodes, including the recording of measurement results.
To compute query processing power for a database of a given size (TPC − H_Power @Size ), we used Equation (1), in compliance with the most recent TPC Benchmark™ H standard specification (revision 2.18.0, p. 99) [1]: where QI(i,0) is the timing interval, in seconds, of a query Qi within the single query stream of the TPC-H power test; RI(j,0) is the timing interval in seconds, of a refresh function RFj within the single query stream of the power test; and SF represents the corresponding scale factor of the database size [1]. Business process organisation and data flow (

Results
The presented solution, as a proof of concept (PoC) was implemented and transferred to the IHIS committee, which is integrated with the Ministry of Social and Labour Security, Ministry of Defence, Ministry of Internal Affairs, Ministry of Health Insurance and Eurostat (Statistical Office of the European Union). IHIS requirements also complied with the EU-based General Data Protection

Results
The presented solution, as a proof of concept (PoC) was implemented and transferred to the IHIS committee, which is integrated with the Ministry of Social and Labour Security, Ministry of Defence, Ministry of Internal Affairs, Ministry of Health Insurance and Eurostat (Statistical Office of the European Union). IHIS requirements also complied with the EU-based General Data Protection Regulation (GDPR).  [58,60,62]. Since the processing of big data requires high-performance computing, we used a cluster computing architecture to take advantage of the massive parallel and NoSQL database [56,63,64].
The Vertica Analytic Database enables the principle of C-Store project [52], which is widely used as a commercial relational database system for business-critical systems. The Vertica database has characteristics that are important for exceeding expected system performance, while meeting all IHIS requirements, such as: (1) massively parallel processing (MPP) system, (2) columnar storage, (3) advanced compression, (4) expanded cloud integration, (5) specialised tool for database design and administration, and (6) built-in functionalities for an analytic workload (e.g., few to ten per second) rather than for a transactional workload (e.g., few hundreds to thousands per second).
To choose Vertica for our client's requirements, we also considered the following benefits: • Provision for an SQL layer as well as support connection to Hadoop and fast data access to ORC, Parquet, Avro, and JSON as column-oriented data; • High data compression ratio, including high-degree concurrency and massively parallel processing (MPP) system for processing tasks; • Analytical database support for Kafka, Spark; • Pricing model of enterprise solution optimised for IHIS requirements; • Potential for huge demands of future analytical workloads; • Cloud integration for future development; and • Compression capabilities that can handle and deliver high-speed results for petabyte scale datasets.

Overview of Key Components
The presented BDA platform, as a distributed and large-scale system, is designed on commodity hardware with gigabit Ethernet interconnections (Figure 3). By adding nodes, the Vertica database allows system performance improvement as per IHIS requirements and general expectations for exponential growth in healthcare data [63].
The BDA platform unifies three key components: Talend (version 6.4), Vertica (version 9.0.1), and Tableau Desktop and Server (version 10.5). As a specialised data integration environment for BDA platforms, Talend provides functionality for Ad-hoc Analysis Preparation, Metadata Management, Data Quality Management, and Data Integration. The Vertica NoSQL database, built on five nodes, provides Data Storage and Data Processing. Data Visualisation is covered by the Tableau Desktop (professional edition).

Overview of Key Components
The presented BDA platform, as a distributed and large-scale system, is designed on commodity hardware with gigabit Ethernet interconnections (Figure 3). By adding nodes, the Vertica database allows system performance improvement as per IHIS requirements and general expectations for exponential growth in healthcare data [63].

Data Integration (DI) Layer
The data integration (DI) layer represents a system module that enables parameterised data manipulation functions, including data transformation, processing control and hierarchy, reading, writing, and parallel or sequential tasks/threads processing. We use the term "metadata" to describe the resulting statistics, classification, or data aggregation tasks. The DI provides metadata for development, test, and production environments. The DI layer also provides visualisation of its processes in the form of data-flow diagrams. Another DI-specific tool generates outputs from preprocessed data. This tool also supports rapid process development, including selection and transformation of large volumes of primary data in parallel multi-threaded execution. To deal with near-future technical and operational challenges, the DI module also contains a debugging tool for software development, testing, and maintenance.

Data Storage (DS)
Data storage (DS) represents a system module that contains cluster-based, horizontally scalable physical architectures built onto NoSQL Vertica databases. The DS runs on commodity hardware with distributed storage capabilities, which allows for Massively Parallel Processing (MPP) over the entire data collection. The DS keeps data in a column format in two containers, Write Optimised Store

Data Integration (DI) Layer
The data integration (DI) layer represents a system module that enables parameterised data manipulation functions, including data transformation, processing control and hierarchy, reading, writing, and parallel or sequential tasks/threads processing. We use the term "metadata" to describe the resulting statistics, classification, or data aggregation tasks. The DI provides metadata for development, test, and production environments. The DI layer also provides visualisation of its processes in the form of data-flow diagrams. Another DI-specific tool generates outputs from pre-processed data. This tool also supports rapid process development, including selection and transformation of large volumes of primary data in parallel multi-threaded execution. To deal with near-future technical and operational challenges, the DI module also contains a debugging tool for software development, testing, and maintenance.

Data Storage (DS)
Data storage (DS) represents a system module that contains cluster-based, horizontally scalable physical architectures built onto NoSQL Vertica databases. The DS runs on commodity hardware with distributed storage capabilities, which allows for Massively Parallel Processing (MPP) over the entire data collection. The DS keeps data in a column format in two containers, Write Optimised Store (WOS) and Read Optimised Store (ROS), for best performance. Each cluster is a collection of hosts (nodes) with Vertica software packages. Each node is configured to run a Vertica NoSQL database as a member of a specific database cluster, supporting redundancy, high availability, and horizontal scalability, ensuring efficient and continuous performance. This infrastructure allows for recovery from any potential node failure by allowing other nodes to take control. For the presented solution (Figure 3), we set a fault tolerance K-safety = 2 [62]. The DI components specify how many copies of stored data Vertica should create at any given time.

Data Quality Management (DQM)
The data quality management (DQM) module supports data quality control including trends and data structures. The DQM generates complex models for end-users supporting data analysis for error detection and correction as well as sophisticated visualisation and reporting required for quality control tasks. It creates, sorts, groups, and searches for validation rules entered in a structured form. Validation rules can be executed over a user-defined dataset and managed centrally.

Metadata Management (MDM)
The metadata management (MDM) module supports the management of user, technical, and operational metadata. The MDM centrally processes metadata from every component of the MDHA system, housed collectively in the data warehouse.
The MDM can compare different versions of metadata and display outputs, including visualisation intended for data reporting. The MDM is able to create dynamic, active charts, and tables, allowing multidimensional and interactive views. The MDM uses sandboxing for testing temporary inputs and outputs and can generate outputs in HTML, PDF, and PPT formats. The MDM component utilises Online Analytical Processing (OLAP) operations over a multidimensional data model. Additionally, it contains a glossary of terms and concept links to enable impact and lineage analysis.

Ad-hoc Analysis Preparation (AAP)
For ad-hoc analysis preparation (AAP) processes, we programmed two different versions into the Talend Open Studio integration tool. In the first version, the MHDA uses Extract Transform and Load (ETL) components of the integration tool. These components read data from data warehouse structures (dimensions and fact tables) into memory. Then, the filtering and aggregation components process the data into an output table. The second version uses Extract Load and Transform (ELT) components of the integration tool. Both ETL and ELT components are able to generate user-friendly, unmodified SQL Data Manipulation Language (DML) statement(s) in the background. The AAP module accelerates the processing time without having to load large amounts of metadata into the program memory. Figure 4 shows forecasting on a historical test dataset supplied by IHIS, where we tested the ARIMA [65] in-database approach to time series. This model can be created either directly in the NoSQL Vertica database, which supports predictive modelling, or in a separate statistical tool such as Tableau, which will take data from the database and return the created model (written in Predictive Model Markup Language (PMML) or another format the database supports).  The data visualisation (DV) module contains tools for describing data perspectives and knowledge discovery from data. The DV components represent data and metadata visually and give interpretations for possible insights. Additionally, we embedded DV components in Tableau to Inclusion of machine learning and data analytics algorithms into the database often leads to increased processing demands on BDA platforms. Regarding data visualisation, Tableau Server (version 10.5, unlimited licenses) provides visualisation via both graphical user interface (GUI) and a web browser for standard end users. Tableau Desktop, however provides additional functionality intended for data analysts and data scientist user profiles.

Data Visualisation (DV)
The data visualisation (DV) module contains tools for describing data perspectives and knowledge discovery from data. The DV components represent data and metadata visually and give interpretations for possible insights. Additionally, we embedded DV components in Tableau to provide data and metadata visualisations in graphs and pictures. Tableau is a popular interactive analytical and data visualisation tool, which can help simplify raw data into easily comprehensible dashboards and worksheets. For example, Figure 5

Data Visualisation (DV)
The data visualisation (DV) module contains tools for describing data perspectives and knowledge discovery from data. The DV components represent data and metadata visually and give interpretations for possible insights. Additionally, we embedded DV components in Tableau to provide data and metadata visualisations in graphs and pictures. Tableau is a popular interactive analytical and data visualisation tool, which can help simplify raw data into easily comprehensible dashboards and worksheets. For example, Figure 5 depicts a part of the data visualisations from one of the IHIS case studies with a geographical map overlay.

Figure 5.
Example of under-10 s real-life diagnosis as regional data visualisation in the Czech Republic using Tableau Desktop (version 10.5). In addition to data integration, geo-mapping functionality provides near real-time pandemic/epidemic mapping, tracking, outbreak spread, and risk data visualisation.

Figure 5.
Example of under-10 s real-life diagnosis as regional data visualisation in the Czech Republic using Tableau Desktop (version 10.5). In addition to data integration, geo-mapping functionality provides near real-time pandemic/epidemic mapping, tracking, outbreak spread, and risk data visualisation.

TPC-H Tests Configuration
TPC-H requires data to be generated for eight tables using the specified scale factor (SF), which determines the approximate amount of data in gigabytes ( Figure 6). We used the TPC-H power test, which measures the throughput/response times of a sequence of 22 queries (defined on p. 29) [1]. Vertica supports the ANSI SQL-99 standard and all queries are applied with no syntax changes. The test datasets were created by the TPC-H DBGEN program (Figure 1). In our tests, we found that the queries Q 9 and Q 21 are more complex in comparison with the commonly expected queries. For power benchmark purposes, we have shared TPCH_SF1000, consisting of row size x1000 (several billion elements).
Vertica supports the ANSI SQL-99 standard and all queries are applied with no syntax changes. The test datasets were created by the TPC-H DBGEN program (Figure 1). In our tests, we found that the queries Q 9 and Q 21 are more complex in comparison with the commonly expected queries. For power benchmark purposes, we have shared TPCH_SF1000, consisting of row size x1000 (several billion elements). The performance achieved using the dataset predefined by TPC-H (Figure 1) shows that the developed eSystem (as PoC) outperformed other competitors with similar product characteristics [66][67][68]. Experiments using the presented BDA platform architecture ( Figure 3) and reported performance (Tables 2 and 3, Figures 7 and 8) were also tested by the government. The developed eSystem was installed within the Czech Republic borders in on-premises centralised mode using data communication channels that are physically separated from the existing Internet infrastructure.  The performance achieved using the dataset predefined by TPC-H (Figure 1) shows that the developed eSystem (as PoC) outperformed other competitors with similar product characteristics [66][67][68]. Experiments using the presented BDA platform architecture ( Figure 3) and reported performance (Tables 2 and 3, Figures 7 and 8) were also tested by the government. The developed eSystem was installed within the Czech Republic borders in on-premises centralised mode using data communication channels that are physically separated from the existing Internet infrastructure.     Monitoring I/O requests to accurately capture workload behaviour is important for the design, implementation, and optimisation of storage subsystems. The TPC-H trace collection on which we conducted the analysis was collected on Vertica 9.0.1 database cluster running on CentOS Linux 7.3 (installed on ext4 file system), five nodes, 2 × 10 cores CPU Intel ® Xeon E5-2660v3@2.66 GHz, 16 × 8 GB = 128 GB RAM, 6×HDD 900 GB (@15K rpm), 2 × 1 Gb Ethernet, 2 × 10 Gb Ethernet, 2 × 16 Gb Fibre Channel Adapter. Due to the ratio of performance to price set by the client, we could not recommend faster disk I/O technology.
As introduced, the TPC-H can also be used as a metric to reflect on multiple aspects of a NoSQL Vertica database system's ability to process queries. The aspects of performance improvements for different database sizes and system expansion are captured collectively in Table 4 and Figures 9-11. As such, it is possible to infer anticipated needs for future system upgrades and expected The performance of TPC-H tests running on a Vertica cluster for 1 TB and 3 TB database sizes (Table 3) are visualised in Figures 7 and 8, indicating similar duration patterns for complex and commonly expected queries.
As per the client's requirement, it was also necessary to include in our report two test runs on the same hardware configurations. The first set of TPC-H query execution times were completed after a cold system restart. The second set of test runs provide an indication on performance improvement after database restart only.
Monitoring I/O requests to accurately capture workload behaviour is important for the design, implementation, and optimisation of storage subsystems. The TPC-H trace collection on which we conducted the analysis was collected on Vertica 9.0.1 database cluster running on CentOS Linux 7.3 (installed on ext4 file system), five nodes, 2 × 10 cores CPU Intel ® Xeon E5-2660v3@2.66 GHz, 16 × 8 GB = 128 GB RAM, 6×HDD 900 GB (@15K rpm), 2 × 1 Gb Ethernet, 2 × 10 Gb Ethernet, 2 × 16 Gb Fibre Channel Adapter. Due to the ratio of performance to price set by the client, we could not recommend faster disk I/O technology.
As introduced, the TPC-H can also be used as a metric to reflect on multiple aspects of a NoSQL Vertica database system's ability to process queries. The aspects of performance improvements for different database sizes and system expansion are captured collectively in Table 4 and Figures 9-11. As such, it is possible to infer anticipated needs for future system upgrades and expected performance based on evidence from measured performance improvements from three to five nodes tested on 1 TB and 3 TB databases.      In comparing performance improvement and scalability perspectives, the results show at least 25% performance increase from 3 to 5 nodes ( Figure 11) on 1 TB database size utilising a low-cost commodity hardware.
Query execution times and performance improvements achieved by adding extra computer resources provide sufficient evidence of a scaled-out design to work in the future with larger datasets.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 16 of 23 Figure 10. The total sum of the 22 TPC-H test query execution times on three to five nodes in the Vertica cluster for the 1 TB test database. Performance improvements are noticeable after the second run due to database restart only and after horizontal scaling with additional nodes. Figure 11. Performance improvement comparing three to five nodes in the Vertica cluster for 1 TB database size.
In comparing performance improvement and scalability perspectives, the results show at least 25% performance increase from 3 to 5 nodes ( Figure 11) on 1 TB database size utilising a low-cost commodity hardware.
Query execution times and performance improvements achieved by adding extra computer resources provide sufficient evidence of a scaled-out design to work in the future with larger datasets.

Discussion
The use of big data technology intended to advance a healthcare eSystem can be evaluated in terms of achieved performance, privacy, security, interoperability, compliance, costs, and future proofing such as scalability to incremental hardware integrations, analytical tools, and data increase. In the case of the Czech Republic national tender (Id. VZ0036628, No. Z2017-035520), vendorindependent solutions had to meet a large number of requirements encompassing all of the abovementioned criteria intended to modernising the national healthcare system within the European Union. Due to contractual obligations with IHIS, as a participating party, we were unable to obtain or to disseminate competitors' details, including their system performance benchmarks or other proposed BDA platform architecture. However, our contract permits dissemination of the results and authorship for PoC before handover to the IHIS. The presented BDA solution accepted by the Czech Republic has met all the requirements and has demonstrated system performance results wellexceeding required thresholds.
Concepts and insights transferrable to other healthcare systems are based on this case study and on the consensus of experts' views, reported literature, and existing knowledge available in the public domain. The authors' views and vision for big data in future healthcare eSystems are based on professional experience, findings from Vertica-based eSystem development, and big data concepts. As such, we wish to emphasise the importance of scalability for future data and performance increases, accommodation of near-future machine learning algorithms and analytical tools, security, and strategic healthcare planning. Therefore, looking beyond the primary scope of this project, we question what the implications for healthcare and other big data industry professionals are. For a Figure 11. Performance improvement comparing three to five nodes in the Vertica cluster for 1 TB database size.

Discussion
The use of big data technology intended to advance a healthcare eSystem can be evaluated in terms of achieved performance, privacy, security, interoperability, compliance, costs, and future proofing such as scalability to incremental hardware integrations, analytical tools, and data increase. In the case of the Czech Republic national tender (Id. VZ0036628, No. Z2017-035520), vendor-independent solutions had to meet a large number of requirements encompassing all of the above-mentioned criteria intended to modernising the national healthcare system within the European Union. Due to contractual obligations with IHIS, as a participating party, we were unable to obtain or to disseminate competitors' details, including their system performance benchmarks or other proposed BDA platform architecture. However, our contract permits dissemination of the results and authorship for PoC before handover to the IHIS. The presented BDA solution accepted by the Czech Republic has met all the requirements and has demonstrated system performance results well-exceeding required thresholds.
Concepts and insights transferrable to other healthcare systems are based on this case study and on the consensus of experts' views, reported literature, and existing knowledge available in the public domain. The authors' views and vision for big data in future healthcare eSystems are based on professional experience, findings from Vertica-based eSystem development, and big data concepts. As such, we wish to emphasise the importance of scalability for future data and performance increases, accommodation of near-future machine learning algorithms and analytical tools, security, and strategic healthcare planning. Therefore, looking beyond the primary scope of this project, we question what the implications for healthcare and other big data industry professionals are. For a start, the Vertica BDA platform runs on Amazon, Azure, Google, and VMware clouds, providing user agility and extensibility to quickly deploy, customise, and integrate a variety of software tools. Vertica enables data warehouse transition to the cloud and on-premises, providing flexibility to start small and grow along with the customer's business requirements. In this case, our client (IHIS) set the conditions for implementation of the proposed solution according to the on-premises principle. The solution had to be physically isolated from the Internet and it was not possible to propose a cloud-based solution.
Nevertheless, Vertica provides end-to-end security with support for industry-standard protocols, so we believe that the future of infrastructure will evolve as a multi-cloud and hybrid solution i.e., as a mixture of on-premises and cloud environments. Such data analytics and management approaches are not meant to be restricted to one type of environment only. For example, Vertica announced the availability of Eon Mode for Pure Storage (https://www.vertica.com/purestorage/) as the industry's first analytical database solution with a separation of computing and storage architecture for on-premises workload distribution.
Other available big data scalable technologies and frameworks [69][70][71][72][73] include Hadoop, an open-source ecosystem (with proprietary file system HDFS); and the Java-based MapReduce framework for storing and batch processing of large amounts of data. Apache Spark is also designed to fit well within big data ecosystems. Apache Spark, for example, is known for keeping large amounts of data (RDD-Resilient Distributed Data) in memory and providing better computing performance than Hadoop (in orders of tens to one hundred). However, Apache Spark in-memory computing engine does not perform key-value storage as Hadoop on HDFS or NoSQL databases within its framework.
Apache Spark and NoSQL databases are often coupled in one ecosystem on top of a Hadoop installation. Considering Apache Spark with Hadoop ecosystems, there are overheads and delays associated with the data movement. Furthermore, such ecosystems require extra administrative efforts, particularly in cases of separate clusters and data duplication.
As a part of the presented solution, the open-source product Talend Open Studio (version 6.4) was used with the intention of data integration, extract-transfer-and-load (ETL) to various data sources (including file systems, Hadoop, NoSQL, RDBM) in batch or real-time processing fashion. The recommended operating system for the Vertica BDA platform is Linux Centos 7.3. Vertica also has support for other Linux-based operating systems, such as (in order of authors' preference): Red Hat Enterprise Linux (RHEL) 7.3, Oracle Enterprise Linux (OEL) 7.3, SUSE 12 SP2, Debian 8.5, and Ubuntu 14.04 LTS. For our eSystem implemented on IHIS premises, we additionally installed open-source software Nagios Core (version 4.1) for network infrastructure and cluster monitoring purposes.
Regarding plans for our BDA solution in 2021, we are considering proposing further improvements to national healthcare and privacy protection by stream data processing from health IoT devices and mobile apps (including wearable devices such as smart watch sensors). Currently, we are conducting tests in a development environment expanded by another platform's components (Eclipse Mosquitto open source broker for carrying out stream data from IoT devices by using MQTT protocol). Acquired test data from IoT devices are transferred as stream data via MQTT Mosquitto broker (https://mosquitto.org), transformed using Apache Spark (https://spark.apache.org) and stored for future data operation purposes in Hadoop. From that layer, data are further processed in a Vertica NoSQL cluster. For IoT platform management purposes, we are using Node.js (https://nodejs.org/) to build fast and scalable network applications and the Angular platform (https://angular.io) for building mobile and desktop applications.

Conclusions
The growing volume of medical records and data generated from near-future IoT and mobile devices mandates the adoption of big data analytics (BDA) in healthcare and related contexts. As part of the national strategy for BDA adoption in healthcare, the Czech Republic healthcare institute (IHIS) has aligned its strategy with the European Union. With over 100 complex requirements, in line with statutory regulations, included in the national public tender, was the inclusion of a reported subset of criteria regarding performance, cost-effectiveness, robustness, and fault tolerance. Such a BDA solution, running on Linux-based open source software (e.g., Talend Open Studio, Python, R, Java, Scala environments), had to be capable of achieving competitive and above-expected threshold results regarding overall system performance evaluation, based on TPC-H industry-standard decision support benchmark.
The tender-winning BDA solution reported here represents a snapshot in time, which exceeded expected operation on healthcare-specific TPC-H benchmark tests. The BDA solution and its control was transferred to IHIS, which over the past seven months has unified the isolated healthcare systems into one eSystem. In addition to demonstrated tests and real-life performance, the current eSystem has great potential to improve national healthcare in the Czech Republic, as well as to accommodate evolving expectations and future data needs. The produced eSystem based on Vertica analytic database management software is future-proofed in terms of stream and high-volume processing, scalability (based on consumer/commodity hardware) and fault tolerance (e.g., shutting down cluster nodes would not cause data loss). Horizontal scalability tests using commodity hardware demonstrate a performance improvement of over 25% by increasing the number of cluster nodes from three to five, providing sufficient evidence of a scaled-out design based on cost-effective commodity hardware.
Currently, the produced BDA healthcare eSystem is physically isolated from the Internet infrastructure by being installed in an on-premises mode within the national geographical boundaries and therefore is considered highly secure, supporting industry standards regarding data security and protocols. The BDA healthcare eSystem supports a variety of open-source software, including various Linux distributions with a growing number of machine-learning libraries and integration of commercial tools such as Tableau. In light of the recent Coronavirus outbreak, the presented eSystem provides regional data geo-mapping visualisation of the Czech Republic within 10 s updates and can exchange data with other healthcare eSystems. In addition to data integration, the geo-mapping functionality provides near real-time pandemic/epidemic tracking, outbreak spread monitoring, and risks data visualisation.
The next steps in the future development of the presented healthcare BDA platform includes: (1) BDA platform extensions supporting medical IoT and mobile apps data streaming so that the existing solution remains 'the blueprint' architecture; (2) support for data-driven decisions during high-traffic events; (3) ongoing horizontal scaling and an increase from 100 TB to 1 PB (Petabyte) processing capability; (4) new approaches to data cleaning, storing, and retrieval with minimal latency; (5) integration with other national registers (e.g., to manage and facilitate drug distribution logistics); and (6) strategic planning using healthcare data.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
The Appendix section refers to supplementary data and ERDs ( Figures A1 and A2). TPC-H benchmarks for the PoC testing phase performed on data provided by IHIS.
The input data structure represents IHIS data for initial load for both patients' and electronic healthcare records (EHR) data warehouses: Multiple compressed warehouse image archives are also included in a CSV semi-column delimited format.

•
File sizes for import into the NoSQL database ranged from 100 GB up to 10 TB.
The Insurance company dataset contains records of three insurance companies: • Three standard quarterly datasets, which have been processed on quarter time base per each insurance company, as well as one corrective (simulates a situation where one insurance company supplied bad data).

•
Standard input data size is less than 1 TB per packet of data.
• Each data package is in a ZIP format with metadata included in the package name as insurance company and package serial number.
Each compressed package contains the following files: • The Anatomical Therapeutic Chemical Classification System (ATC).

•
Code lists and performance group, which are separated in the files to assign attributes of dynamically loaded dimensions.
delimited format.  File sizes for import into the NoSQL database ranged from 100 GB up to 10 TB.
The Insurance company dataset contains records of three insurance companies:  Three standard quarterly datasets, which have been processed on quarter time base per each insurance company, as well as one corrective (simulates a situation where one insurance company supplied bad data).  Standard input data size is less than 1 TB per packet of data.  Each data package is in a ZIP format with metadata included in the package name as insurance company and package serial number.
Each compressed package contains the following files:  The Anatomical Therapeutic Chemical Classification System (ATC).  Code lists and performance group, which are separated in the files to assign attributes of dynamically loaded dimensions.   Figure A2. ER model of data warehouse for patient's health record (part II). The provided figure is intended for online viewing in high resolution.