Choosing a Data Storage Format in the Apache Hadoop System Based on Experimental Evaluation Using Apache Spark

: One of the most important tasks of any platform for big data processing is storing the data received. Different systems have different requirements for the storage formats of big data, which raises the problem of choosing the optimal data storage format to solve the current problem. This paper describes the ﬁve most popular formats for storing big data, presents an experimental evaluation of these formats and a methodology for choosing the format. The following data storage formats will be considered: avro, CSV, JSON, ORC, parquet. At the ﬁrst stage, a comparative analysis of the main characteristics of the studied formats was carried out; at the second stage, an experimental evaluation of these formats was prepared and carried out. For the experiment, an experimental stand was deployed with tools for processing big data installed on it. The aim of the experiment was to ﬁnd out characteristics of data storage formats, such as the volume and processing speed for different operations using the Apache Spark framework. In addition, within the study, an algorithm for choosing the optimal format from the presented alternatives was developed using tropical optimization methods. The result of the study is presented in the form of a technique for obtaining a vector of ratings of data storage formats for the Apache Hadoop system, based on an experimental assessment using Apache Spark.


Introduction
The development of technologies that work with data have contributed to the emergence of various tools for big data processing [1]. Big data means such volumes of information collected from various sources, where processing using traditional methods becomes very difficult or impossible [2,3]. At the same time, most researchers agree that big data can be understood through not only the volume, but also their ability to be sources for generating valuable information and ideas [4].
The development of platforms for analytical data processing has become a popular direction in the field of working with big data [5]. Such platforms are designed not only for processing, but also for storing data. The best known among such platforms is Apache Hadoop [6]. Hadoop is a set of software utilities [7], the core of which is a distributed file system that stores data in certain formats, and a data processor that implements the MapReduce processing model [8].
However, due to various limitations of this system, new implementations of big data processing systems were implemented (e.g., Hive [9], Impala [10], Apache Spark [11], etc.). These tools, on the one hand, are independent products, and on the other hand, are additional tools for the Apache Hadoop system. Frameworks, such as Apache Spark [12], allow working with a variety of file formats. For this study, five formats supported by this framework were selected: avro, CSV, JSON, ORC, and parquet. The aim of the paper is to study the features of file formats used for storing big data, as well as to conduct an experimental evaluation of the formats. Such tools allow providing a convenient language for data selection. In addition, these tools can work with a variety of file formats.
However, when developing system architectures based on Apache Hadoop, the question of choosing the optimal data storage format may arise.
This paper describes the five most popular formats for storing big data in the Apache Hadoop system, presents an experimental evaluation of these formats and tropical optimization methods for choosing an effective solution.
Tropical (or idempotent) mathematics is an area of applied mathematics that studies the theory and applications of semirings with idempotent addition.
The use of models and methods of tropical algebra allows reducing several nonlinear problems to a linear form in terms of an idempotent semiring or semifield. In other words, the transformation of the original problem to the form of tropical optimization with idempotent operations leaves the optimality properties of the original and the reduced problem to be invariant. This means that it is a symmetric transformation. The use of this approach simplifies the interpretation of the results and finds application in solving practical problems of planning, placement, and decision-making.
One of the directions of tropical mathematics is the development of methods for solving optimization problems that can be formulated and solved in terms of idempotent mathematics (tropical optimization problems). The theory is based on the correspondence between constructions over the field of real numbers and similar constructions related to various idempotent semirings. For example, [13] describes the solving problems that reduce to the best approximate solution, in the sense of the Chebyshev metric of a vector linear equation, where the product is understood in the sense of tropical algebra.
The study is aimed at developing a technique that is able to define the most effective data format in the condition described for data format usage in the use of big data.
The article is organized as follows. Section 2 provides background regarding the problem of choosing software components. Section 3 describes the details of the experimental setup. The configuration of the hardware, software, and the data preparation method is given, and the results of the experiment are described. Section 4 presents mathematical methods for choosing the solution based on the tropical optimization theory. Section 5 presents the discussion. The Conclusion presents the results obtained during the study.

Background
The problem of choosing software components has been studied by various authors [14][15][16][17][18][19][20][21][22][23][24]. The papers present a selection of various components of the system, as well as methods for the experimental evaluation of the selected components. For example, [14,15] present a methodology for choosing libraries for software development using methods of evolutionary calculus. In [16][17][18], an experimental assessment of integration messaging systems is presented. This paper presents methodology for conducting a study of the data transfer rate in such systems. However, the authors do not give recommendations on the choice of an integration messaging system.
Papers [19][20][21][22][23][24] present studies of big data storage formats, such as avro, parquet, orc, etc. These studies represent the results of studying different formats in terms of performance, or choosing an alternative for specific purposes. For example, the authors in [23] study data storage formats for storing data in web systems or data for research in bioinformatics, respectively. These are highly specialized studies for specific tasks. Study [24] addresses a problem similar to the current study. However, this study only affects the avro and parquet storage formats and indirectly talks about other data storage formats.
Symmetry 2021, 13, 195 3 of 22 However, the cited works do not investigate the issue of choosing a data storage format. Most of these studies provide the results of examining each format and recommendations on the choice for the problem under study. It should be noted that data storage can be carried out using relational databases. However, in recent years, NoSQL solutions have gained popularity [25,26], some of which support different data storage formats.
Using a sub-optimal format for storing big data can lead to various errors and difficulties when working with data. Thus, the use of formats that do not support complex data structures (such as arrays or dates) can lead to incorrect result sets when fetching data using SQL-like tools in a system, such as Hadoop. In addition, the use of formats that do not use data archiving or metadata can lead to an increase in data retrieval time. Therefore, for systems where the speed of analytical data processing is critical, a forced delay may occur. For systems that require platform independence, there may be a need for expensive system modifications. Different storage formats for big data affect a number of criteria for software satisfaction. These criteria include the speed of working with data (reading, writing, analytical processing, etc.), the speed of development and implementation, portability to different platforms, etc.
The current study is focused on developing techniques for selecting the optimal storage format for Apache Hadoop. The basis of the proposed technique is an experimental assessment of data storage formats and a mathematical model for choosing the optimal solution based on tropical optimization methods. To select the data format, the paper solves the problem of constructing an assessment system, a system of criteria and methods for obtaining their reliable numerical values based on experimental studies, as well as choosing and using an optimization method based on quality criteria. It should be noted that the study does not solve the issue of the functionality of the proposed formats, but reflects the feasibility of using them in the proposed conditions.

Method and Experiment
For the current study, the following data storage formats will be considered: avro, csv, json, orc, parquet.
Let us consider the features of the internal structure of the studied data storage formats.
Avro is a row-oriented data storage format. It contains a schema in the JSON format, which allows faster reading and interpretation operations [27]. The file structure consists of a header and data blocks [27]. Avro format supports primitive types, such as Boolean, int, long, float, etc., and complex types, such as array or map.
Comma-separated values (CSV) is a textual format describing data in form of a table. A CSV file does not support different data types and structures-all data are presented as strings.
JavaScript object notation is a simple text format. JSON has gained popularity in storing big data in document databases. JSON supports data types and structures, such as string, number, Boolean, arrays, null, internal objects.
Optimized row columnar is a column-oriented storage format [28]. Data in ORC are strongly typed. ORC has a shared internal structure-division into strips independent from each other. ORC files contain metadata storing in compressed forms, and include statistical and descriptive information, indexes, stripe, and stream information. ORC supports a complete set of types, including complex types (structures, lists, maps, and unions) [29]. ORC also complies with ACID requirements by adding delta files.
Apache Parquet is a column-oriented binary format. It allows defining compression schemes at the column level and adding new encodings as they appear [30]. Parquet supports simple (Boolean, int32, float, etc.) and complex (byte_array, map) data types. The Parquet file contains metadata written after meaningful data to provide a one-pass write. Table 1 contains the comparation of the described storage formats. To estimate the data storage formats, the technique described in the following was developed. The technique consists of two parts:

1.
Experimental evaluation of the studied data storage formats.

2.
Analysis of Spark data processing functions using different storage formats.

Experimental Evaluation
The first stage in the study was to conduct an experimental evaluation of these formats. The experimental evaluation consisted of simulated processing of the dataset. An experimental stand was deployed for testing. For the study, a dataset of 10 million records was generated. Appendix A contains the experimental resources. Figure 1 illustrates an experiment schema. The Parquet file contains metadata written after meaningful data to provide a one-pass write. Table 1 contains the comparation of the described storage formats. To estimate the data storage formats, the technique described in the following was developed. The technique consists of two parts: 1. Experimental evaluation of the studied data storage formats. 2. Analysis of Spark data processing functions using different storage formats.

Experimental Evaluation
The first stage in the study was to conduct an experimental evaluation of these formats.
The experimental evaluation consisted of simulated processing of the dataset. An experimental stand was deployed for testing. For the study, a dataset of 10 million records was generated. Appendix A contains the experimental resources. Figure 1 illustrates an experiment schema. The host file system contains the generated dataset. A Java virtual machine, which supports the Spark application executor (driver), is installed on the host. After starting the Spark application, Spark context is generated, the storage files are being read by the Spark application, and the operation studied is being performed. Since the Spark application supports lazy evaluations [31], the moment of completion of the operation is considered to receive the count of the records in the resulting dataset. The host file system contains the generated dataset. A Java virtual machine, which supports the Spark application executor (driver), is installed on the host. After starting the Spark application, Spark context is generated, the storage files are being read by the Spark application, and the operation studied is being performed. Since the Spark application supports lazy evaluations [31], the moment of completion of the operation is considered to receive the count of the records in the resulting dataset.
For each data format, a study was conducted, consisting of test runs of the Spark application and performing the same set of operations. The following calculations were conducted. The total size of the dataset. One of the most important characteristics of data is its volume. Since volume becomes critical in systems for processing and storing big data, it becomes necessary to search for such a format that would have the ability to store data with a minimum volume.
Reading all lines. The most important parameter in data processing and analysis is the time to read the data. In this test, the time taken to read all records was measured.
Filtering data. Data filtering is one of the most frequently used operations in data processing and analysis.
Search for unique strings. An equally important operation in data processing and analysis is the search for unique records.
Sorting. Sorting is the most complex operation, both in design and in databases, so the results of this test are important when analyzing big data storage formats.
Grouping. Grouping is also one of the most used operations in data analysis and processing. The total size of the dataset. One of the most important cha volume. Since volume becomes critical in systems for processing becomes necessary to search for such a format that would have with a minimum volume.
Reading all lines. The most important parameter in data pro the time to read the data. In this test, the time taken to read all rec Filtering data. Data filtering is one of the most frequently u processing and analysis.
Search for unique strings. An equally important operation analysis is the search for unique records.
Sorting. Sorting is the most complex operation, both in desi the results of this test are important when analyzing big data stor Grouping. Grouping is also one of the most used operatio processing.    volume. Since volume becomes critical in systems for processing becomes necessary to search for such a format that would have with a minimum volume. Reading all lines. The most important parameter in data pr the time to read the data. In this test, the time taken to read all rec Filtering data. Data filtering is one of the most frequently u processing and analysis.
Search for unique strings. An equally important operation analysis is the search for unique records.
Sorting. Sorting is the most complex operation, both in desi the results of this test are important when analyzing big data stor Grouping. Grouping is also one of the most used operatio processing.             However, the results obtained cannot be considered final, si and storage system is constantly updated with this data. To stud processing time when working with different formats, three generated with a similar structure: 5 million records, 25 million records.
For each of the obtained data sets, the operations described e out. Each of the obtained values was used to calculate the rat processing time. The rate was calculated using the following form  However, the results obtained cannot be considered final, since any data processing and storage system is constantly updated with this data. To study the rate of change in processing time when working with different formats, three more data sets were generated with a similar structure: 5 million records, 25 million records, and 50 million records.
For each of the obtained data sets, the operations described earlier were also carried out. Each of the obtained values was used to calculate the rate of change in the file processing time. The rate was calculated using the following formula: where duration i is an operation duration for i th dataset Below are graphs of the results of calculating the rate of changes in the processing time of files of different formats, according to operations. The Y-axis on the Figures 8-12 shows the rate calculated for datasets of different volumes.                 As presented, there is an anomaly in the operations of sor grouping in the form of a slight change in the processing time of the studying the algorithm for processing these formats for hidden func Apache Spark framework that affect such changes.

Analysis of the Spark Algorithm
To further compare the storage formats of big data, let us anal used by the framework for each operation for each data storage fo understood how the framework works with each data storage statistical information on the operation of the algorithm, the Spark W tool built into the main framework [32] was used, which collects inf operation of the application.
As stated earlier, the Spark framework supports lazy evalua analyze the algorithms, two operations were performed: transformat the operation under study, and action, which is the operation of coun objects in the dataset.
The following were chosen as the main metrics: -Stages count; -Task count on each stage; -Shuffle spill (memory/drive) on each stage; -Median value statistics.
For example, consider the following three operations performed: -Searching for unique objects; -Data filtering; -Sorting.
Search for unique objects. Figure 13 shows an algorithm that is c storage formats. As it can be seen in the figure, the algorithm consi As presented, there is an anomaly in the operations of sorting, filtering, and grouping in the form of a slight change in the processing time of the files. This requires studying the algorithm for processing these formats for hidden functions built into the Apache Spark framework that affect such changes.

Analysis of the Spark Algorithm
To further compare the storage formats of big data, let us analyze the algorithms used by the framework for each operation for each data storage format. It should be understood how the framework works with each data storage format. To obtain statistical information on the operation of the algorithm, the Spark Web User Interface tool built into the main framework [32] was used, which collects information about the operation of the application.
As stated earlier, the Spark framework supports lazy evaluation. Therefore, to analyze the algorithms, two operations were performed: transformation, represented by the operation under study, and action, which is the operation of counting the number of objects in the dataset.
The following were chosen as the main metrics: -Stages count; -Task count on each stage; -Shuffle spill (memory/drive) on each stage; -Median value statistics.
For example, consider the following three operations performed: -Searching for unique objects; -Data filtering; -Sorting.
Search for unique objects. Figure 13 shows an algorithm that is common to all data storage formats. As it can be seen in the figure, the algorithm consists of three stages. Appendix B contains the detailed characteristics obtained for this operation.
Search for unique objects. Figure 13 shows an algorithm that is storage formats. As it can be seen in the figure, the algorithm cons Appendix B contains the detailed characteristics obtained for this ope Data filtering. Data filtering algorithm consists of two stages. Figure 14 shows the schema of this algorithm. Appendix B contains the detailed characteristics obtained for this operation. As it can be seen from the figure, the algorithm consists of three stages. Each stage consists of the code function generation (WholeStageCodegen) and shuffle stage (Exchange). Appendix B contains detailed characteristics obtained for this operation.  Sorting. Sorting, unlike the previous two operations, consists o which consists of one or more stages. Figures 15 and 16 show the alg with files on the first and second job, respectively. Appendix B characteristics obtained for this operation. Sorting. Sorting, unlike the previous two operations, consists of two jobs, each of which consists of one or more stages. Figures 15 and 16 show the algorithm for working with files on the first and second job, respectively. Appendix B contains detailed characteristics obtained for this operation.
Sorting. Sorting, unlike the previous two operations, consists o which consists of one or more stages. Figures 15 and 16 show the alg with files on the first and second job, respectively. Appendix B characteristics obtained for this operation.   Sorting. Sorting, unlike the previous two operations, consists which consists of one or more stages. Figures 15 and 16 show the alg with files on the first and second job, respectively. Appendix characteristics obtained for this operation.   Following the above analysis, the data processing algorithm is the same for each storage format.
The differences in the results obtained are insignificant, which means that the results obtained during the experiment are typical for the presented data storage formats. Thus, the framework does not significantly affect the conduct of the experiment.

Results
The problem of choosing the optimal format was presented in the form of several optimization tasks using the tropical optimization algorithm [33,34].
The aim of the algorithm is to calculate the rating vector of the alternatives presented. It consists of the following stages:
The comparison of the criteria themselves; 3.
Optimization task solution.
It should be noted that the comparison of alternatives is based on the required task. Appendix C contains the tables of paired comparison of the criteria and alternatives.
There are no special rules for evaluating alternatives. Each researcher has the right to determine the rules for the comparative assessment of alternatives. In addition, the assessment of alternatives and criteria depends on the tasks assigned to the researcher.
It is important to know the following features of compiling comparison matrices: a ij describes the degree of preference for alternative i over alternative j; -a ij = a −1 ji . As part of the current research, the following methodology for evaluating ratings was developed. It consists of the following rules for choosing preferences:

1.
Platform independence is not the most important characteristic, because the study is aimed to find the optimal file format for Apache Hadoop system.

2.
The ability to record complex structures has an important role, since it provides great opportunities for data processing and analysis. 3.
The ability to modify data is not critical, since most big data storage platforms comply with the "write once-read many" principle. 4.
The possibility of compression has an indirect role since it affects the volume of data.

5.
The presence of metadata is an indicator that does not require analysis, because it affects the speed of reading and grouping data.
According to the experiments results, the following rules were formulated: 1.
The data volume plays an important role in the processing and storage of big data, but is not critical, since the storage hardware has become much cheaper in recent years.

2.
Reading all lines is an important indicator, since it most fully reflects the speed of data processing using a particular data storage format.

3.
The filter and search for unique values are equally important characteristics; however, these functions rely on the subtraction of all strings, the importance of which is defined in the previous paragraph.

4.
Applying a function, grouping, and finding the minimum value are the next most important indicators, since they are interesting from the point of view of analytics than engineering.

5.
Sorting is the least important of the criteria presented, as it is most often used to visualize data.
To assess the preference of one or another indicator, the following scale is introduced: Before describing the algorithm, it is necessary to introduce the basic definitions of tropical algebra [33].
Consider the set of positive real numbers R + , on which two operations are defined: the operation of idempotent addition ⊕ with a neutral element 0, the result of which is the choice of the maximum of the terms, and the operation of multiplication ⊗ with a neutral element 1 (defined as usual). For each element x on the set, an inverse element x −1 is defined, such that xx −1 = x −1 x = 1. The resulting system is called the idempotent semifield.
The definition of matrices in an idempotent semifield is usual. The trace is defined as follows: The tropical spectral radius of a matrix is a following scalar: The asterate operator means next operation: Using the given matrices, calculate the rating vector of alternatives [34]. The algorithm for calculating the rating vector of alternatives consists of the following steps: 1. According to criteria matrix, calculate the weight vector of criteria: If result matrix contains more than one vector (up to a positive factor), find the least differentiating vectors: and the most differentiating vectors: where P is a matrix µ −1 C * removing columns linearly independent from another, P sk is a matrix created from matrix P by nullifying every element, except P sk , and k and s indexes are calculated using following formula: 3. Using w 1 = (w (1) i ) and w 2 = (w (2) i ) calculate weighted amounts of paired comparisons matrixes: Calculate the least differentiating vector of the rating of alternatives: If resulting vector is not unique, calculate it in a different way: Calculate the most differentiating vector of the rating of alternatives: If resulting vector is not unique, calculate it in a different way: removing columns linearly independent from another, Q sk is a matrix created from matrix Q by nullifying every element except Q sk , and k and s indexes are calculated using following formula: At first, calculate spectral radius using calculation rules in the independent semifield -µ = 1.5874.
To find the least differentiating weights vector, let us calculate weights vector. The result gives the following the least differentiating weights vector: The resulting matrix contains to vectors. For following calculation, we choose only one vector-for example, the first one. Using the weights vector, let us calculate the least differentiating vector of rating of alternatives: For the example, we take only the first weights vector. Let us calculate the most differentiating vector of rating of alternatives. As a result, the following vector was obtained: The resulting vector looks similar to the previous one. According to this decision, the format rating is built as follows: The parquet and orc formats received the highest score in the ranking of alternatives. The avro and csv formats showed an average result. Json had the worst result.

Discussion
This study presented is an example of the application of experimental evaluation and tropical optimization methods to find the optimal data storage format when developing a data processing and storage system using the Apache Hadoop platform and the Apache Spark framework.
This study can be used to build data processing and storage systems based on the Apache Hadoop platform or similar solutions. In addition, it can be an example of a solution to similar problems when a selection from a list of alternatives is required. Such questions can arise both when choosing data storage formats and other tools and system components.
The resulting solution is based on the results of specific tests and does not reflect the popularity or functionality of the formats under consideration, it only reflects the expediency of using the formats in the conditions under consideration-the presence of big data and the use of the Apache Hadoop platform.
However, unlike other similar studies [14,[17][18][19], this study solved the problem of choosing an effective solution by methods of tropical algebra based on matrices constructed on the basis of experimental parameter estimates. The use of the proposed approach made it possible to take into account several investigated parameters for evaluating data storage formats without introducing additional hypotheses about the priorities of the evaluation criteria. The use of tropical analysis tools, its symmetric properties during the transition to idempotent semirings made it possible to form an algorithm for choosing solutions, which will expand its use for similar problems when using other formats or other experimental methods.
For example, big data processing systems are cluster systems, which allow processing more data using several nodes connected to a computer network. In this study, a single node was used, the results of which may differ from clustering a similar dataset. There-fore, the authors plan to continue the experiment using clusters with different types of configuration and resources.
In addition, the rate of change in processing time formats depending on the volume should be studied.

Conclusions
The paper presents a methodology for choosing a data storage format based on an experimental evaluation of the five most popular formats using the Apache Spark framework.
The study consisted of two parts: experimental and computational. In the experimental part, each format was evaluated based on several test runs. For the experiment, an experimental stand was deployed with tools for processing big data installed on it. The aim of the experiment was to find out characteristics of data storage formats, such as the volume and processing speed for different operations using the Apache Spark framework.
In the computational part, an algorithm for choosing alternatives using tropical optimization methods was presented, the essence of which is to solve a multi-criteria decisionmaking problem, which results is presented in the form of vector of preference degrees.
The article also provides an example of assigning ratings for alternatives. The algorithm helps to find the optimal solution for the specific requirements of the system.
The contribution of the study (presented in this paper) is that a technique of choosing a data storage format has been developed, using the example of experimental assessment and the methods of tropical algebra. As an example, the formats supported by the Apache Hadoop system and the Apache Spark framework, as one of the most popular frameworks for processing big data, were used.
It should be noted that this study was not aimed at studying the functional features of the presented data storage formats. The main goal of the study is to build a rating of the presented alternatives based on their experimental assessment using the entered parameters necessary for solving a specific problem.
These techniques can be useful for practical use. In any company, when developing or using software, there is always a choice of which package or system to use. An important selection criterion is the exchange of data with other components, which are often determined by data formats. In the paper, for the considered use cases of big data, the choice of the best solution was made, which can be useful in a similar case. However, the methods and experimental studies and quality indicators, as well as the optimization algorithm, are described in sufficient detail, and can be used for similar tasks where the choice of format is important, while the conditions for using the formats and the set of alternative options may be different.
This study can be one example for solving similar problems, without introducing additional hypotheses concerning the priorities of the evaluation criteria.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A Experimental Resources
To evaluate experimental assessment, the experimental stand was built. The stand configuration is presented in Table A1. Table A2 contains the description of the data generated for the study.

Appendix B Statistics of the Operations Performed
The statistics obtained during the experimental evaluation are presented below. Tables A3-A5 show the comparative characteristics at each stage for operation of searching for unique objects.  Tables A8-A11 describe each stage of the sorting operation.

Appendix C Matrices of Alternatives Comparisons
To assess the alternatives, matrices of comparison of criteria and alternatives were compiled for each criterion. Table A12 describes the criteria preference matrix. Tables A13-A20 describe the matrices of the alternatives according to each criterion.