Performance Evaluation of an Independent Time Optimized Infrastructure for Big Data Analytics that Maintains Symmetry

: Traditional data analytics tools are designed to deal with the asymmetrical type of data i


Introduction
The term Big Data reflects a volume of data that is huge and yet growing exponentially with time. Such large and complex data is too difficult to process and manage effectively, with the help of traditional data management tools. Big data has novel values, which originate out of necessity for hefty firms such as Yahoo, Google, and Facebook, to evaluate large amounts of information [1]. Due to a revolution in technology, these days, millions of people produce large amounts of data (for example Big data can be structured, semi-structured, or unstructured, which causes hurdles in handling. The solution to this problem is Hadoop, Hadoop with R and on Spark to enhance the performance of parallelism and scalability [12][13][14][15][16][17][18][19][20][21][22][23][24][25][26]. Data handling (catching and storing enormous information) has increased critical consideration in recent years, for example, the MapReduce model. Nowadays, the necessity of further development of platforms is realized, which can bridle these innovations to increase significant understanding to settle on knowledgeable commercial decisions. The usage of data analytics on such datasets is normally termed as big data analytics. Data mining and AI strategies are now utilized over a wide scope of enterprises to help associations in improving their business, reduce risks, and increment productivity. Areas utilizing these strategies incorporate venders, banking establishments, and insurance agencies as well as health-related fields [27,28]. In the present marketplace, data analytics has turned into a business prerequisite for many organizations hoping to increase a viable lead on virtualization and distributed computing [29]. This has been significantly perceived in the technical support space of leading multinationals. Call centers have considered the use of information application as an approach to streamlining the commercial and adding knowledge in esteems to client's desires, a necessity in an industry-tested by financial weights, and prolonged challenges [30].
The requirement for proficient, scale-out responses to help component failures and give data consistency persuaded the advancement of the Google File System (GFS) [31] and the MapReduce [32] model in the mid-2000s. The reason behind the Google File System and MapReduce is to circulate data over the product servers to such an extent that computation of information is performed, where Symmetry 2020, 12, 1274 3 of 15 the information is stored. This methodology dispenses with the need to move the data over the network system to be prepared. Moreover, strategies for assuring the flexibility of the cluster and load adjustment of processing were indicated. GFS and MapReduce structure are the reason for the Apache Hadoop venture, involving two principal parts: the Hadoop Distributed File System (HDFS) and Hadoop MapReduce [33].
HDFS is the appropriate storage element of Hadoop with sharing nodes succeeding in a master/slave manner. All data is captured in HDFS [34] are divided into chunks, which are imitated and distributed across over various slave nodes on the cluster recognized as data nodes, with a master node recognized as name node, keeping metadata for example chunks involving files, wherein the cluster these chunks are found.
MapReduce [35] is a programming model of the Hadoop and governed by a software daemon recognized as the Job Tracker. A Job is a MapReduce program that includes the implementation of Map and Reduces function over a dataset. The MapReduce programming model also depends on a master/slave design. The Job Tracker runs on the master node and allocates Map and Reduce assignments to the slave node on the cluster. The slave nodes run different software daemon termed as Task-Tracker that is liable for starting up the Map and Reduce function and revealing the advancement back to the Job Tracker. The prolonged Hadoop ecosystem embraces a growing list of results that integrate or enlarge Hadoop's competences, such as the mahout machine library (such as collaborative filtering, classification, and clustering algorithms), which is an open-source tool able to run on the top of the Hadoop, to deliver distributed analytics abilities [36]. HBase, which is a distributed database, gives real-time read/write ability for the dataset that resides in the HDFS [37]. Hive is a query language that alters the MapReduce task using HiveQL queries for execution on a cluster [38]. However, some studies have shown their experiments in machine learning such as Mahout k-means clustering used on a 1.1 GB data set for scalability quality assessment [39]. System performance is evaluated by clustering algorithms on an 11 GB Wikipedia data set [40], for the evaluation of the performance of the clustering algorithms over the Hadoop environment using the 1987 Reuters dataset [41]. However, some significant contribution, Discovery Information using Community detection (DICO) for online social networking to provide cyber security have been nicely discussed by various workers [42][43][44].
In the present era, traditional distributed techniques are not capable to store and capture data because of its less scalability of the environment. On the other hand, relational databases have limited schema and data warehouses are not capable to process its entire data due to huge size. Due to the above limitations, big data requires a novel model, which is flexible and perform equivalent processing efficiently. In the present paper, we proposed a model (A Three Master Node (Name Node) Model), which is platform-independent for big data analytics and delivers faster results in comparison to traditional systems. Performance test (such as response time of the system, execution time (time taken by the MapReduce programming model), and throughput) between the proposed model and the legacy model (existing model) is done on three different data sets. These sets have different behaviors and uses diverse algorithms such as K-Means clustering, Navies Bayes, and the Recommender system for inspection of its application capabilities.

Proposed Hybrid Framework: A Three Master Node (Name Node) Model
In the present proposed model (Figure 1), we attached three master nodes (Name-Nodes), three Data-Nodes and one client node, which works in the demilitarized zone (called an edge node), i.e., the Resource manager will run on the Master node and the Data-Node services, and application master and node manager will run on the Data-Node. Different tools are used for machine learning to make it suitable for any kind of data processing. The analytical tools are put on the master node (all the master nodes share their respective recourses with each other) to keep the services as well as data movement in a symmetrical synchronization. Mahout, R, and Splunk is installed on the system. Splunk is the log analytical tool, which has taken the logs of the system and help the user to analyze the logs for the error or any kind of security or data breaches. data movement in a symmetrical synchronization. Mahout, R, and Splunk is installed on the system. Splunk is the log analytical tool, which has taken the logs of the system and help the user to analyze the logs for the error or any kind of security or data breaches. Thus, in the proposed model, there is no restriction on the Java Virtual Machine (JVM), as it depends on the number of sessions created by the user on the Name-Node, which actually works for user queries (read and write). Each write and read will create a JVM, which is the Application master container in our environment. There is no compulsion that only one application master can run on a data node. This means more than one JVM can run on a single Data-Node as per the data and the user request.
The present model includes the concept of the high availability, which means if one master server, which is the Active Name-Node, went down or somehow experienced a high load making it unable to work, then the standby Name-Node automatically changes its state to active and start to serve the user request. This happens because of an on daemon called the secondary Name-Node, which is basically the service that keeps track of edit logs and file system images (FS image), and whenever the Name-Node went down it provides the latest logs to help the standby Name-Node to get active.
All the master services are hosted on the master servers and three Data-Nodes are the nodes, where the real data will reside. In this experiment, the replication factor is 3, and block size we configured 128 MB. This value will help to run the job (Hadoop job or user query is called an application or job in Hadoop) in an efficient manner as the default size of a block is 64 MB. Block is a storage unit, where the data is kept on the HDFS. Data is in the form of the data blocks in the cluster. All the nodes, i.e., three Master nodes and three Data-Nodes share the common HDFS. The (HDFS) is a distributed file system designed to run on commodity hardware, which is highly fault-tolerant, designed, and developed to be deployed on low-cost hardware. HDFS provides high throughput access to application data. It is suitable for applications that have large data sets. HDFS enables symmetrical streaming access to file system data.
According to the working scenario of the proposed model, the Client node sends the request and the query to the name node, this request or query can be a read or write request. The Master service, Mahout, is deployed on Name-Node 1 (Active), R-Hadoop on Name-Node 2 (Standby), and Splunk on Name-Node 3 (standby). If a user opens a session of R-Hadoop and Mahout at the same time, then they can use both the services at the same time. After the processing, the processed data is sent to , as it depends on the number of sessions created by the user on the Name-Node, which actually works for user queries (read and write). Each write and read will create a JVM, which is the Application master container in our environment. There is no compulsion that only one application master can run on a data node. This means more than one JVM can run on a single Data-Node as per the data and the user request.
The present model includes the concept of the high availability, which means if one master server, which is the Active Name-Node, went down or somehow experienced a high load making it unable to work, then the standby Name-Node automatically changes its state to active and start to serve the user request. This happens because of an on daemon called the secondary Name-Node, which is basically the service that keeps track of edit logs and file system images (FS image), and whenever the Name-Node went down it provides the latest logs to help the standby Name-Node to get active.
All the master services are hosted on the master servers and three Data-Nodes are the nodes, where the real data will reside. In this experiment, the replication factor is 3, and block size we configured 128 MB. This value will help to run the job (Hadoop job or user query is called an application or job in Hadoop) in an efficient manner as the default size of a block is 64 MB. Block is a storage unit, where the data is kept on the HDFS. Data is in the form of the data blocks in the cluster. All the nodes, i.e., three Master nodes and three Data-Nodes share the common HDFS. The (HDFS) is a distributed file system designed to run on commodity hardware, which is highly fault-tolerant, designed, and developed to be deployed on low-cost hardware. HDFS provides high throughput access to application data. It is suitable for applications that have large data sets. HDFS enables symmetrical streaming access to file system data.
According to the working scenario of the proposed model, the Client node sends the request and the query to the name node, this request or query can be a read or write request. The Master service, Mahout, is deployed on Name-Node 1 (Active), R-Hadoop on Name-Node 2 (Standby), and Splunk on Name-Node 3 (standby). If a user opens a session of R-Hadoop and Mahout at the same time, then they can use both the services at the same time. After the processing, the processed data is sent to HDFS to get stored.
In Equation (1), A symbolizes the Compression ratio. A = 1, when no compression is deployed and it can be changed if the specific type compression is applied (for example snappy, etc.) B signifies the Replication factor; commonly, it is 3 for the production cluster. S(D) describes the actual data size, when data is injected in Hadoop. C is the transitional data factor; typically, it is 1/3 or 1/4. This is the Hadoop's Intermediate employed storage deployed to stacking transitional results of Map Jobs, which is 120% or 1.2 times of the entire size; this is due to fundamental of the HDFS that wishes possibility for the file system. For example, assume the total size of the cluster is 1200 TB but it is recommended to utilize up to 1000 TB. By Equation (2), S(d) expresses the disk space existing per node.
Process N3 Exit (0); Step 3-Name-Node (M1, M2, M3) process the Job, given by the C N (User/Client machine) Step 4-if (job request Algo = K-Means)//All 3 algorithms are used for the application purpose// Step 5-If the result is final (result after reducer/completion of the algo) { Output stores in HDFS } The proposed model follows the above algorithm, which work on three equations, i.e.: The above three equations are accountable for the selection of the one managerial (administrative) power from N1, N2, or N3. All three Name-Node has shared resources. According to Equation (3), the size of S(D) is 1 GB or less and D s belongs to S p or B p processing, N2 administrative power will start processing the job. Similarly, in Equation (4) if S(D) is more than 1 GB and D s belongs to S p or B p processing, N1 managerial power will start processing the job. Equation (5) is functional when D s ext (extension) is ". Log", which means, when Log analytics is needed, N3 administrative power will start working.
As stated above, respective recourses of all the three Name-Node, has shared with each other. Therefore, managerial (administrative) power of the one Name-Node will work as the collective three Name-Node. After the assortment of the administrative power, Master node (Name-Node) will start processing the Job provided by the C M . For this purpose, we have selected three different algorithms such as K-Means (Clustering), Naïve Bayes (classification), and Recommender (collaborative filtering). After finalizing the reducer phase/successful completion of the algorithm, the result are stored in HDFS.

Contribution
As per the traditional (legacy model) Hadoop infrastructure, there is one Name-Node (Master Node) that is coupled with the several Data-Node (worker nodes), which implement on compatible data sets, i.e., dedicated for the specific algorithm (expresses the dependency of the platform).
In the present proposed model, three frameworks, namely mahout to work on machine learning, R for machine learning as well as data analysis, and log analysis Splunk, were established:

1.
For smooth running of the system, a maven repository for the mahout is build, which can easily use the machine learning library. 2.
core-site.xml, yarn-site.xml arrangements are altered as per the requirement of cluster. This is Twisted for the HA (High Availability) formation.

3.
To make R work on Linux and Hadoop, an R-server is built.

4.
To simplify a feasible condition of this combined cluster requires multiple repositories and OS repository for accessing and building "YUM," which can work on Linux.
Worker nodes elect the master based on availability (distance, i.e., the same rack would be given preference), available resources, i.e., cores and rams, and the connection that is the SSH (Secure shell) communication between the master and the worker nodes. This way, the ensemble (worker nodes) chose the master of the cluster.

5.
Serialization is tuned in yarn-site.xml with respect to the Data Node RPC (Remote Procedure Call, which is the heartbeat signal offers the communication among the Name-Node and Data-Node and accountable to allot job processing location) and movement of the data directly in the form of input splits from the HDFS.
By the mutual HDFS, any kind of data can be kept, which might come from diverse sources or nodes. In other words, the proposed model can manage the whole data lake. The data lake is just like a pond, which has numerous organisms, stones, gravels, and sand in its own native environment. The basic idea behind the data lake is to have a centralize storage for all the enterprise data in the captive from raw data (which is just generated and no transformation is applied on this data) to fully processed data, which is further utilized to have an insight from the data by applying analytics, machine learning techniques, and visualizations.

Legacy Model
In the legacy model, there is a single node, which is composed of all the demons (services) running in a single machine. It provides a full environment to run Hadoop related jobs. There is one JVM, i.e., java virtual machine, which helps to run the MapReduce jobs in the single node cluster. Whenever the job is submitted by the client (i.e., the user), first the job is analyzed by the Name-node and sent to the resource manager. The resource manager coordinates with the node manager and the application master for providing the container and the resources. Thereafter, the job proceeds with the different stages for processing.
As mentioned above, there is one JVM, which means one master node is applied. Furthermore if the master node went down somehow, then the whole cluster will be dissolved, i.e., the whole cluster will be destroyed, and to save this single point failure there is the concept of the secondary name node, which is responsible for taking the edit logs as well as the File system image (Fs image), but still, to make the master node or the Name-Node alive from dead, it takes a downtime in which the Hadoop admin figure out and analyze the edit logs and fs image to get the cause of the Name-Node down. Several forms of algorithms, such as frequent itemset mining, classification (Navies Bayes), Clustering (K-Means), and collaboration filtering (recommender algorithms), have been presented and executed separately by the legacy model to demonstrate the parallelism and scalability [12][13][14][15][16][17][18][19][20][21][22][23][24][25][26]45,46].
In traditional Hadoop infrastructure, there is one Name-Node (Master Node) that is connected with the multiple worker nodes (Data-Node) and creates a single HDFS, which executes a compatible data set, i.e., assigned for the particular algorithm (which shows the dependency of the platform), that restricts the performance to maintain the data streaming. Therefore, there is a need for a model that shares common HDFS, which enables the symmetrical streaming access of the data for the multiple nodes configured with the different variety of tools. The present proposed model reflects the novelty of performance with respect to job execution processing time on different data sets.

Data Description
To perform this experiment, three-dummy data sets are considered from reliable sources. The size of each data sets is 9 GB (9216 MB) and the description of the data sets are as follows: Data set 1: Twenty News group data is the set of information, which contains a survey on persons through the website, i.e., what kind of updates they read and what they like [47]. Data set 2: The movie dataset contains numerous files, which have a customer_id that describes who watches the movie, the movie id, and the year of release. These movies are separated as per the votes and score provided by the users. The movie id is in a range from 1 to 17,770 [48]. Data set 3: The Spam SMS dataset consists of a message per line with the label and the raw text. This The SMS is not always spam, but can be a message between two individuals. This is a completely text-based dataset. It has 6000+ rows of messages and two columns. The spam message is mined from website with the help of a web crawler [49].

Job Execution Process in Hadoop Infrastructure
The Performance of the proposed and the existing model has also measured on the basis of response time, execution time (time taken by the MapReduce programing model), and throughput. In this experiment, in the system configuration of the proposed model, each node has 2 GB of memory and two cores for processing, whereas the legacy model memory and CPU size are increased up to 6 GB and four cores with stable 9 GB datasets of three different kinds. Each dataset between legacy and proposed model is run three times. Outputs of the experiment are given as mean ± SE. Student's t-test is applied in between the output results of legacy and proposed model to observe Significance difference. Steps of the job execution is as follows 1.
New stage: When a user provide instruction to the Name-Node to execute any job, it comes under a new stage.

2.
Submitted stage: When the Name-Node accepts and submits the job to the Resource manager for further execution, it comes under submitted stage.

3.
Analyzing stage: The Resource manager will do some validation and check for the input path and output path and the data.

4.
Accepted stage: After completing the submitting and analyzing stage the job will wait for the Application Master (AM) container to be launched. This stage is just above the execution/running stage.

5.
Running stage: When the Application master (AM) container assigned, the job starts running in the cluster, where the data resides, i.e., on the Data-Nodes. 1.

Map stage:
The map stage is the stage where the data is processed and converted in the Key Value pairs and these key value pairs are then given to reducer.

2.
Reducer Stage: In the reducer stage, the key value pairs are club together as per the characteristics, such as what are the values associated with key 1 and so on.

3.
Committer stage: In the committer stage, the output from the reducer process are clubbed into a single output file so that instead of having multiple part files user can have a single output file.

6.
Finished stage: when the resource manager marked the job to be finished and AM container is cleaned up by the node manager in simple word the job is successfully completed.

Response Time Comparison between Proposed and Legacy Model
Response time is the total time taken by new stage, submitted stage, analyzing stage and accepted stage. Below, Table 2 defines the response time of K-Means, Naïve Bayes, and recommender algorithms using three different datasets of 9 GB on both models. Significant, when student's t-test is applied in between response time of legacy and proposed model-* (P < 0.01). Values are given as mean ± SE.
Response time defines the quick behavior towards the job instruction submitted to the Name-Node. It is clear from Figure 2 that performance of the proposed model is better than the legacy model in terms of response.  The throughput defines a single unit of work done over a specified time period to evaluate the efficiency of the system [50]. For this, we have calculated the execution time (shown in Table 4) for the single Mapper while executing three different algorithms using three different data sets on both models. Significant when student's t-test is applied in between response time of legacy and proposed model-* (P < 0.01). Values are given as mean ± SE. Figure 4 defines throughput measurement, that shows time taken by the single Mapper (single unit of work done) of the MapReduce programming model on both (Proposed and Legacy) models, while running the K-Means, Recommender, and Naïve Bayes algorithms using three different data sets of 9 GB.

Throughput (Time Taken by the Single Mapper of Map Function)
The throughput defines a single unit of work done over a specified time period to evaluate the efficiency of the system [50]. For this, we have calculated the execution time (shown in Table 4) for the single Mapper while executing three different algorithms using three different data sets on both models.

Discussion
The present paper deals with the performance of the proposed model with respect to the legacy model to measure the difference of response time, running time (time taken by the MapReduce programing model), and throughput. The validation of the proposed model is the process and activities intended to verify its performance as expected, in line with their design objectives. The validation also identifies the potential limitations and assumptions, and assesses their possible impact. The proposed model system configuration, which contains three Name-Node, three Data-Node, and one client node, in which each node has 2 GB of memory and two cores for processing, whereas in the legacy model, memory and CPU size are increased up to 6 GB with four cores. In this way both models have stable 9 GB datasets of three different kinds. In the present work it was expected and assumed that the proposed model should be better response time, running time, and throughput.
Response time in data set 1 of the legacy model and proposed model on three algorithms, i.e., K-means, recommender, and Naïve Bayes, reduced from 2.5 to 1.

Discussion
The present paper deals with the performance of the proposed model with respect to the legacy model to measure the difference of response time, running time (time taken by the MapReduce programing model), and throughput. The validation of the proposed model is the process and activities intended to verify its performance as expected, in line with their design objectives. The validation also identifies the potential limitations and assumptions, and assesses their possible impact. The proposed model system configuration, which contains three Name-Node, three Data-Node, and one client node, in which each node has 2 GB of memory and two cores for processing, whereas in the legacy model, memory and CPU size are increased up to 6 GB with four cores. In this way both models have stable 9 GB datasets of three different kinds. In the present work it was expected and assumed that the proposed model should be better response time, running time, and throughput.
Response time in data set 1 of the legacy model and proposed model on three algorithms, i.e., K-means, recommender, and Naïve Bayes, reduced from 2.5 to 1.1, 2.1 to 1.0, 2.4 to 1.3 s, respectively, which is also noticed in Data set 2 and Data set 3. It shows that the proposed model gives a quick response to the job process as compare to the legacy model. MapReduce (Running Time) completion time in data set 1 of the legacy model and proposed model on three algorithms, i.e., K-means, recommender, and Naïve Bayes reduced from 0.33 to 0.16, 0.37 to 0.19, 0.34 to 0.16 h, respectively. When Data set 2 and Data set 3 are run, the same trend of MapReduce completion time is found. It defines that the proposed model completed the Mapper, Reducer, and Committer phase of the MapReduce programming model in less time as compared to the legacy model. Throughput in data set 1 of the legacy model and proposed model on three algorithms, i.e., K-means, recommender, and Naïve Bayes reduced from 17.04 to 9.16, 17.5 to 10, 16.67 to 9.17 s, respectively. A similar pattern is observed in Data set 2 and Data set 3. It defines that the proposed model took less time to complete a single Mapper, which is 128 MB as compared to the Legacy model. The improved performance of the proposed model establishes the hypothesis that our model overcomes the limitation of the resources of the legacy model. The present proposed model configuration performance is given for the first time, so that comparative published/known findings are lacking. Table 2 defines comparative response time of the both models. It is measured on the basis of total time taken by the four initial stages of the job execution, i.e., the new stage, submitted stage, analyzing stage, and accepted stage. It is evident from Table 2 and Figure 2 that the response time of the three algorithms (K-Means, Naïve Bayes, and recommender) using three different datasets on the proposed model is significantly (P < 0.01) less compared to the legacy model. That denotes the quick response of the proposed model (Infrastructure).
The experiment of the execution time/running stage (time taken by the MapReduce programming model) shown in Table 3 defines MapReduce programing model/running stage (total time taken by the MapReduce programming model) on both models (the proposed and the existing model), respectively. Table 3 clearly indicates that three algorithms (K-Means, Naïve Bayes, and recommender) on three different datasets for proposed model completed their jobs faster than the existing model. In fact, the proposed model took significantly (P < 0.01) less time as compare to the legacy model, as shown in Figure 3. Table 4 defines the throughputs of the both models using three algorithms (K-Means, Naïve Bayes, and recommender) on three different datasets. According to the definition of the throughput, a single unit of work is done in the specified time period to evaluate the efficiency of the system, so that in this experiment the time taken to complete job by the single mapper of Map function is calculated. It is measured with the help of a data set with chunks the size of a single mapper, which is 128 MB in case of YARN (Yet Another Resource Negotiator). The present study demonstrates that the proposed model is significantly (P < 0.01) efficient compared to the legacy model, as shown in Figure 4. Therefore, the above findings clearly indicate that the proposed model is better than the legacy model in terms of response time of the system, execution time/running stage (time taken by the MapReduce programing model), and throughput. In addition to this, the proposed model is highly efficient to configure any kind of algorithms and any kind of data of different size simultaneously.

Conclusions
Conclusively, the present paper demonstrates that a time-efficient proposed model that shares common HDFS with the integration of three Name-node (having Mahout, R-Hadoop, and Splunk), three Data-node, and one client node, can communicate and share their business demands with all three Name-nodes. Time optimization of our model is done by the performance evaluation of the response time, execution time, and throughput using three different algorithms (K-means clustering, Navies Bayes, and Recommender system) on three different data sets. With the outcome of the definition and the experiments in the present study, it can be concluded that the proposed model is highly efficient compared to the legacy model. In addition to this, by the common HDFS, all the Data-node and Name-node can access any data residing in the HDFS, whether it is processed or used by any other node. Any algorithms on the different types of data can be run efficiently with the help of our present proposed model for Big Data analytics. In the future, the performance of the model can be explored for a bigger number of master nodes as well as data nodes.

Conflicts of Interest:
The authors declare no conflict of interest