SHIYF: A Secured and High-Integrity YARN Framework

: Cloud computing is becoming a powerful parallel data processing method, and it can be adopted by many network service providers to build a service framework. Although cloud computing is able to e ﬃ ciently process a large amount of data, it can be attacked easily due to its massively distributed cluster nodes. In this paper, we propose a secure and high-integrity YARN framework (SHIYF), which establishes a close relationship between speculative execution and the security of Yet Another Resource Negotiator (YARN, MapReduce 2.0). SHIYF computes and compares the MD5 hashes of the intermediate and ﬁnal results in the MapReduce process by launching the speculative executions in a certain ratio, which is able to ﬁnd actual and potentially malicious nodes in the Hadoop cluster. The prototype of SHIYF is implemented based on Hadoop 2.8.0. In this paper, theoretical derivations and experiments show that SHIYF not only guarantees the security and high integrity of the MapReduce process but also successfully locates the malicious nodes and the potential malicious ones in Hadoop, while increasing overhead slightly. Furthermore, the malicious node detection ratio is more than 87%.


Introduction
With the rapid development of hardware, software, and high-speed networks, many cloud service providers (e.g., Google and Amazon) are establishing increasing cloud computing (CC) [1,2] realities around the world, as shown in Figure 1. However, many organizations and customers remain reluctant to accept CC because of security issues [3,4]. Therefore, solving relevant security problems has considerable significance for the long-term development of CC [5].
Some safety precautions are already eliciting attention [6]. For instance, Gartner et al. identified seven security issues of CC that must be solved [7]. Grobauer et al. discussed the security vulnerabilities of the cloud platform [8]. Jansen et al. proposed guidelines on privacy in public CC [9]. Furthermore, the security guidance of CC is published by the Cloud Security Alliance and IEEE [10].
Hadoop [11] is considered the most widely used CC platform [12], and it represents the state-of-the-art efficient framework for processing vast amounts of distributed data [13]. However, most researchers are still focusing on the performance and application of MapReduce rather than its security. For example, Dawei Jiang et al. identified five design factors that affect the performance of Hadoop [14]. Yanpei Chen et al. built the case for going beyond benchmarks for MapReduce performance evaluations [15]. Rares Vernica et al. studied how set similarity joins can be efficiently performed in parallel using the popular MapReduce framework [16]. A few studies have been conducted on the security of MapReduce, such as one study that focused on Airavat, which is a MapReduce-based system that provides strong security and privacy guarantees for distributed computations on sensitive data [17]. Reference [18] introduces a new privacy-preserving encoding with "somewhat homomorphic" properties for MapReduce. In addition, some whitepapers about security designs of Hadoop and MapReduce have been published [19][20][21]. adoop [11] is considered the most widely used CC platform [12], and it represents the stat t efficient framework for processing vast amounts of distributed data [13]. However, chers are still focusing on the performance and application of MapReduce rather tha ity. For example, Dawei Jiang et al. identified five design factors that affect the performan op [14]. Yanpei Chen et al. built the case for going beyond benchmarks for MapRe rmance evaluations [15]. Rares Vernica et al. studied how set similarity joins can be efficie rmed in parallel using the popular MapReduce framework [16]. A few studies have cted on the security of MapReduce, such as one study that focused on Airavat, which educe-based system that provides strong security and privacy guarantees for distrib utations on sensitive data [17]. [18] introduces a new privacy-preserving encoding ewhat homomorphic" properties for MapReduce. In addition, some whitepapers about sec ns of Hadoop and MapReduce have been published [19][20][21].
owever, there are still several security breaches in Hadoop 2.0 as follows.
• Service identity forging. Since there is not the service certification, any malic node can masquerade as a security node and join Hadoop cluster to get/calcu data as long as it knows the ResourceManager (RM) address. • User identity forging. Because there is not the user authentication, any malic client can fake the user identity to get Hadoop Distributed File System (HDFS) However, there are still several security breaches in Hadoop 2.0 as follows.
• Service identity forging. Since there is not the service certification, any malicious node can masquerade as a security node and join Hadoop cluster to get/calculate data as long as it knows the ResourceManager (RM) address. • User identity forging. Because there is not the user authentication, any malicious client can fake the user identity to get Hadoop Distributed File System (HDFS) data or do job management. • Lack of authorization mechanism. A client can do anything, such as a job submitted by user A can be killed by user B at will. • Data communications are not encrypted. They are vulnerable to eavesdropping.
Yet Another Resource Negotiator (YARN, also known as MapReduce 2.0/MRv2) is one of the key features in the second-generation Hadoop and provides resource management and scheduling for large-scale MapReduce environments [22]. Research on the performance or security of YARN remains in its infancy. For example, Li Ping et al. proposed an energy-efficient service level agreement-aware scheduling scheme that allocates an appropriate amount of resources to MapReduce applications with YARN architecture [23]. Reference [24] presented a new methodology for determining desired hardware and software configuration parameters for MapReduce 2.0 applications; thus, the representative applications achieved up to 5× performance improvement. An energy-aware fair scheduling framework based on YARN (denoted as EFS) is proposed by Shao Yanling et al., which can effectively reduce energy consumption whilst meeting the required service level agreements (SLAs) [25]. Diarchy increases the reliability of YARN based on the sharing and backup of responsibilities between two masters working as peers [26]. A SECapacity scheduler was proposed for the requirement of isolating the user's job and data security [27]. Reference [28] proposed a novel partitioner for improving YARN performance (NPIY) based on Hadoop 2.6.0, which adopts an innovative parallel sampling method to distribute intermediate data.
In Hadoop, speculative execution is equal to replication (also known as double-check), which sacrifices space for time. Replication-based techniques mainly rely on redundant computation resources to execute duplicated individual tasks for verifying the consistency of results [29]. W. Wei et al. proposed a service integrity assurance framework for MapReduce (SecureMR) based on Hadoop 1.0 [30]. SecureMR provides a decentralized replication-based integrity verification scheme for ensuring the integrity of MapReduce in open systems. Although MapReduce is a programming model for data processing on YARN, executing duplicated tasks (using speculative execution in Hadoop 2.0) is still an effective way to prevent service identity forging for identifying the malicious nodes in Hadoop cluster.
In this paper, we focus on improving the security of YARN. A secure and high-integrity YARN framework (SHIYF) is proposed by extending Hadoop 2.8.0. Sacrificing space for security is the In Hadoop, speculative execution is equal to replication (also known as double-check), which sacrifices space for time. Replication-based techniques mainly rely on redundant computation resources to execute duplicated individual tasks for verifying the consistency of results [29]. W. Wei et al. proposed a service integrity assurance framework for MapReduce (SecureMR) based on Hadoop 1.0 [30]. SecureMR provides a decentralized replication-based integrity verification scheme for ensuring the integrity of MapReduce in open systems. Although MapReduce is a programming model for data processing on YARN, executing duplicated tasks (using speculative execution in Hadoop 2.0) is still an effective way to prevent service identity forging for identifying the malicious nodes in Hadoop cluster.
In this paper, we focus on improving the security of YARN. A secure and high-integrity YARN framework (SHIYF) is proposed by extending Hadoop 2.8.0. Sacrificing space for security is the key idea of SHIYF. Extensive theoretical derivations and experiments are performed to prove the framework's validity, security, and malicious node detection efficiency. The main contributions are summarized as follows. MRv2 job and achieves a malicious node detection ratio of more than 90%. 5. Experiment results show that SHIYF can ensure the security of MRv2 services while increasing overhead slightly. Moreover, the malicious node detection ratio is between 87% and 93.3%. 6. This finding is in line with the expectation of theoretical derivation. The remainder of the paper is organized as follows. In the next section, we introduce the SHIYF design and implementation in detail. Section 3 provides the theoretical derivations. Section 4 reports the experimental results of SHIYF and compares them with the theoretical results. Section 5 contains the conclusions and prospects for future work.

SHIYF Design
Given that SHIYF verified the validity of the intermediate and final results generated by Map and Reduce in a programming model, more TaskAttempts launched the speculative executions in a certain ratio, in contrast with YARN. These additional TaskAttempts executed the same tasks and computed the MD5 Message-Digest Algorithm (MD5) hashes of results. The programming model of SHIYF is shown in Figure 2.   MRAppMaster is the ApplicationMaster implementation of MapReduce, which allows MapReduce to be run directly on YARN. Its main function is to manage the life cycle of the job, including: • Job creation, initialization, startup, and so on.

•
Apply to RM for resources and reallocate resources.

•
Container startup and release.

•
Monitoring the operation status of the job. • Job recovery.
In the runtime environment of SHIYF, MRAppMaster provided a set of security mechanisms, including the secured task duplication and assignment, intermediate result check, and final results verification. MRAppMaster could be applied to two containers to execute the same TaskAttempt for a task by the speculative execution. When MRAppMaster received two MD5 hashes from different containers, it would compare whether they were consistent or not. If they were the same, then MRAppMaster considered that the task had been completed and the result was correct. Otherwise, it applied for the third container to execute the same TaskAttempt again to verify the result. Finally, MRAppMaster considerd two results with the same MD5 hashes as the right result. If not, then it would judge that this task failed. Therefore, SHIYF is shown in Figure 3. Simply and clearly, an MRAppMaster controls only one task so that three different jobs are used. MRAppMaster is the ApplicationMaster implementation of MapReduce, which allows MapReduce to be run directly on YARN. Its main function is to manage the life cycle of the job, including: • Job creation, initialization, startup, and so on.
• Apply to RM for resources and reallocate resources.
• Container startup and release.
• Monitoring the operation status of the job.
• Job recovery. In the runtime environment of SHIYF, MRAppMaster provided a set of security mechanisms, including the secured task duplication and assignment, intermediate result check, and final results verification. MRAppMaster could be applied to two containers to execute the same TaskAttempt for a task by the speculative execution. When MRAppMaster received two MD5 hashes from different containers, it would compare whether they were consistent or not. If they were the same, then MRAppMaster considered that the task had been completed and the result was correct. Otherwise, it applied for the third container to execute the same TaskAttempt again to verify the result. Finally, MRAppMaster considerd two results with the same MD5 hashes as the right result. If not, then it would judge that this task failed. Therefore, SHIYF is shown in Figure 3. Simply and clearly, an MRAppMaster controls only one task so that three different jobs are used.  Figure 3. The secured and high-integrity Yet Another Resource Negotiator (YARN) framework.

SHIYF Implementation
In YARN, RMApp is a data structure that preserves an application life cycle in RM. Its realization class is RMAppImpl. This class maintains an application state machine that records several application states and state-driven events. The finite-state machine (FSM) of RMAppImpl is shown in Figure 4. When MRAppMaster is launched, the application will enter into the core state "RUNNING." Every application may run several times. The transitions of states are determined by the return values of MRAppMaster. RMApp judges that an application has failed when all RMAppAttempts failed. Therefore, MRAppMaster is the most important module in SHIYF.

SHIYF Implementation
In YARN, RMApp is a data structure that preserves an application life cycle in RM. Its realization class is RMAppImpl. This class maintains an application state machine that records several application states and state-driven events. The finite-state machine (FSM) of RMAppImpl is shown in Figure 4. When MRAppMaster is launched, the application will enter into the core state "RUNNING." Every application may run several times. The transitions of states are determined by the return values of MRAppMaster. RMApp judges that an application has failed when all RMAppAttempts failed. Therefore, MRAppMaster is the most important module in SHIYF.  In addition, because YARN uses the asynchronous programming model based on an eventdriven mechanism, every component is an event handler. MRAppMaster establishes the relations with other components by the events and assigns all types of events to the corresponding schedulers. Figure 5 shows the components and the services of MRAppMaster. ContainerAllocator (CA), Speculator, Job, Task, and TaskAttempt must be redesigned to implement SHIYF.

SHIYF ContainerAllocator
CA is a resource scheduler. It divides the application resource tasks into three categories, such as Failed Map, Reduce, and Map, from high to low priority. The workflow of SHIYF CA is as follows: Step 1. To add the speculative tasks in a certain ratio, the chosen Maps/Reduces and their double resource applications would be sent to RM at the same time.
Step 2. If the scheduling conditions of Reduces were met, then CA would give priority to them.
Step 3. CA would be allowed double resource occupation simultaneously in SHIYF.
Step 4. CA would apply resources for a task again once it failed before.
Step 5. If a task ran too slowly, the CA would apply extra resources to start its speculative task. In addition, because YARN uses the asynchronous programming model based on an event-driven mechanism, every component is an event handler. MRAppMaster establishes the relations with other components by the events and assigns all types of events to the corresponding schedulers. Figure 5 shows the components and the services of MRAppMaster. ContainerAllocator (CA), Speculator, Job, Task, and TaskAttempt must be redesigned to implement SHIYF.  In addition, because YARN uses the asynchronous programming model based on an eventdriven mechanism, every component is an event handler. MRAppMaster establishes the relations with other components by the events and assigns all types of events to the corresponding schedulers. Figure 5 shows the components and the services of MRAppMaster. ContainerAllocator (CA), Speculator, Job, Task, and TaskAttempt must be redesigned to implement SHIYF.

SHIYF ContainerAllocator
CA is a resource scheduler. It divides the application resource tasks into three categories, such as Failed Map, Reduce, and Map, from high to low priority. The workflow of SHIYF CA is as follows: Step 1. To add the speculative tasks in a certain ratio, the chosen Maps/Reduces and their double resource applications would be sent to RM at the same time.
Step 2. If the scheduling conditions of Reduces were met, then CA would give priority to them.
Step 3. CA would be allowed double resource occupation simultaneously in SHIYF.
Step 4. CA would apply resources for a task again once it failed before.
Step 5. If a task ran too slowly, the CA would apply extra resources to start its speculative task.

SHIYF ContainerAllocator
CA is a resource scheduler. It divides the application resource tasks into three categories, such as Failed Map, Reduce, and Map, from high to low priority. The workflow of SHIYF CA is as follows: Step 1. To add the speculative tasks in a certain ratio, the chosen Maps/Reduces and their double resource applications would be sent to RM at the same time.
Step 2. If the scheduling conditions of Reduces were met, then CA would give priority to them.
Step 3. CA would be allowed double resource occupation simultaneously in SHIYF.
Step 4. CA would apply resources for a task again once it failed before.
Step 5. If a task ran too slowly, the CA would apply extra resources to start its speculative task.
Step 6. CA would withdraw all resource distributions to this node when it failed too many times. In SHIYF, if a node failed more than five times, then CA would withdraw all its resource applications and judged it as the malicious node.
The CA in SHIYF is shown in Figure 6. Step 6. CA would withdraw all resource distributions to this node when it failed too many times. In SHIYF, if a node failed more than five times, then CA would withdraw all its resource applications and judged it as the malicious node.
The CA in SHIYF is shown in Figure 6.

SHIYF Speculator
In Hadoop, the speculative execution sacrifices space for time. However, sacrificing space for security is the key idea of SHIYF. We designed it to conduct speculative execution repeatedly for MD5 computation and hash comparison. The corresponding event handler of speculative execution is referred to as a speculator.
Prior to launching a new speculator in SHIYF, the redesigned MRAppMaster must check whether the current task conforms to the following three conditions: 1. Whether the current task had already a backup task. Every task could had two speculative tasks and a maximum of three. 2. The ratio of completed tasks was not less than MINIMUM_COMPLETE_PROPORTION_TO_SPECULATE (5%). Only then could the Speculator had sufficient historical task information to estimate estimatedReplacementEndTime. 3. DefaultSpeculator could launch speculative execution in a certain probability without calculating the speculationValue. Because when the Speculative Execution Ratio reached 30%, SHIYF could achieve a desired malicious node detection ratio according to Section 3. We redesigned three parameters in SHIYF as follows.
• MINIMUM_ALLOWED_SPECULATIVE_TASKS = 10. It represents the minimum number of total speculative tasks that are allowed for a job. • PROPORTION_TOTAL_TASKS_SPECULATABLE = 0.35. It denotes the highest percentage of speculative tasks to the total tasks is 35%. • PROPORTION_RUNNING_TASKS_SPECULATABLE = 0.3. It indicates the highest percentage of speculative tasks to all running tasks is 30%. Therefore, the number of speculative tasks that are allowed to perform in a job (numberAllowedSpeculativeTasks) is the maximum of the following three values.
Meeting the requirements of SHIYF and limiting the number of speculative tasks in a job, which can effectively prevent the waste of resources caused by a large number of tasks launching speculative tasks at the same time.

SHIYF Speculator
In Hadoop, the speculative execution sacrifices space for time. However, sacrificing space for security is the key idea of SHIYF. We designed it to conduct speculative execution repeatedly for MD5 computation and hash comparison. The corresponding event handler of speculative execution is referred to as a speculator.
Prior to launching a new speculator in SHIYF, the redesigned MRAppMaster must check whether the current task conforms to the following three conditions: 1.
Whether the current task had already a backup task. Every task could had two speculative tasks and a maximum of three.

2.
The ratio of completed tasks was not less than MINIMUM_COMPLETE_PROPORTION_TO _SPECULATE (5%). Only then could the Speculator had sufficient historical task information to estimate estimatedReplacementEndTime.

3.
DefaultSpeculator could launch speculative execution in a certain probability without calculating the speculationValue.
Because when the Speculative Execution Ratio reached 30%, SHIYF could achieve a desired malicious node detection ratio according to Section 3. We redesigned three parameters in SHIYF as follows.
• MINIMUM_ALLOWED_SPECULATIVE_TASKS = 10. It represents the minimum number of total speculative tasks that are allowed for a job. • PROPORTION_TOTAL_TASKS_SPECULATABLE = 0.35. It denotes the highest percentage of speculative tasks to the total tasks is 35%.
It indicates the highest percentage of speculative tasks to all running tasks is 30%.
Therefore, the number of speculative tasks that are allowed to perform in a job (numberAllowedSpeculativeTasks) is the maximum of the following three values. Meeting the requirements of SHIYF and limiting the number of speculative tasks in a job, which can effectively prevent the waste of resources caused by a large number of tasks launching speculative tasks at the same time.

SHIYF Security Control
To ensure the security of YARN, SHIYF should compute and compare the MD5 hashes of intermediate and final results. This process involves three services, namely, Job, Task, and TaskAttempt. Their communication processes in SHIYF are shown in Figure 7. Two types of tasks are used; one is the normal task, and the other is chosen to check the MD5 hashes of results.  To ensure the validity of results and locate the malicious nodes, Job, Task, and TaskAttempt have the following additional functions: • In Job, the hostnames of nodes that failed to execute tasks were recorded and written to HDFS logs. If failure occurred more than five times, then SHIYF would consider these nodes the malicious nodes.

•
If two TaskAttempts disposed of the same data but returned the different hashes, then Task launched the other speculative TaskAttempt to verify the result again. The node returned the wrong hash once, it would be recorded as the potential malicious node. If the hash comparison failed twice, then Task returned "JOB_TASK_UNCOMPLETED" to Job and restarted. Moreover, the three nodes in this task would all be considered the potential malicious nodes.

•
TaskAttempt with speculative execution should compute the MD5 hash of the result and transmit it to Task; however, the normal one does not do that. They are highlighted in red in Figure 7. We could find the malicious nodes and the potential ones by reviewing the logs on HDFS.

SHIYF State Management
Many components and services were used in SHIYF. The FSMs of TaskAttempt, Task, and Job should be changed accordingly to achieve SHIYF security control.
First, the FSM of SHIYF TaskAttempt is shown in Figure 8. "TA_MD5_COMPUTE" was added to the state "RUNNING" to compute the MD5 hashes of the intermediate and final results. However, only tasks selected to check the result validity executed the speculative tasks and MD5 hash computations. The verification process of TaskAttempt in SHIYF is as follows: To ensure the validity of results and locate the malicious nodes, Job, Task, and TaskAttempt have the following additional functions:

•
In Job, the hostnames of nodes that failed to execute tasks were recorded and written to HDFS logs. If failure occurred more than five times, then SHIYF would consider these nodes the malicious nodes.

•
If two TaskAttempts disposed of the same data but returned the different hashes, then Task launched the other speculative TaskAttempt to verify the result again. The node returned the wrong hash once, it would be recorded as the potential malicious node. If the hash comparison failed twice, then Task returned "JOB_TASK_UNCOMPLETED" to Job and restarted. Moreover, the three nodes in this task would all be considered the potential malicious nodes.

•
TaskAttempt with speculative execution should compute the MD5 hash of the result and transmit it to Task; however, the normal one does not do that. They are highlighted in red in Figure 7.
We could find the malicious nodes and the potential ones by reviewing the logs on HDFS.

SHIYF State Management
Many components and services were used in SHIYF. The FSMs of TaskAttempt, Task, and Job should be changed accordingly to achieve SHIYF security control.
First, the FSM of SHIYF TaskAttempt is shown in Figure 8. "TA_MD5_COMPUTE" was added to the state "RUNNING" to compute the MD5 hashes of the intermediate and final results. However, only tasks selected to check the result validity executed the speculative tasks and MD5 hash computations. The verification process of TaskAttempt in SHIYF is as follows: Dratio is obtained by the above derivations, then

Theoretical Results
We experimented on the effects of b, P, t, and Er to the detection ratio Dratio based on the theoretical arithmetic presented in Section 3.1. Figure 8 shows the change of Dratio against the execution ratio Er and the number of the blocks b, where t = 40 and P = 0.2. Dratio increases along with the increase of Er and decreases slightly with the increase of b. The change of Dratio against the Er and the malicious behavior probability P, where b = 20 and t = 10, are shown in Figure 9. Evidently, Dratio increases along with the increase of P. Given a certain Er, the presence of more malicious behaviors corresponded to the increased effectiveness of the operation of SHIYF. Step 1. TaskAttempt judged whether the task was checked based on the signature added by Job.
Step 2. TaskAttempt saved the related messages of container runtime such as LaunchTime, trackerName, httpPort, and MD5 hash of the result.
Step 3. TaskAttempt then renewed the counter messages and informs the history server and speculator service.
Step 4. TaskAttempt computed and transmitted the MD5 hashes to Task and informed it that this attempt was successful.
In addition, three relevant services, "TA_UPDATE," "TA_UPDATE/StatusUpdater," and "TA_CONTAINER_COMPLETED," need to be changed accordingly to control and trigger the state transition, as emphasized in Figure A1.
Second, the FSM of SHIYF Task is shown in Figure A2. Three important improvements are as follows: 1.
To check some task results, SHIYF need TaskAttempts and their speculative executions to run in parallel until they completed and returned MD5 hashes. Therefore, Task in SHIYF should be allowed two or three speculative Attempts retained at the same time, namely, Task will not kill other corresponding Attempts when it receives "T_ATTEMPT_COMMIT_PENDING" recording the Attempt running.

2.
When Task received "T_ADD_SPEC_ATTEMPT," it created a new speculative Attempt to run the same task. All the tasks were chosen for checking, and their speculative executions were added the sign "Extra_SETask" as the determined criteria of launching MD5 computation in TaskAttempt.

3.
When a TaskAttempt runs successfully, Task in YARN will receive "T_ATTEMPT_SUCCEEDED" and kill other Attempts. However, SHIYF needed to compare the MD5 hashes of the two same TaskAttempts to ensure the validity of the results. Therefore, even if an Attempt has been completed and the MD5 hash has been returned, Task still should wait for the other speculative TaskAttempts until the end. Thus, the other several relevant improvements had been occurred as follows.
• An event "T_ATTEMPT_MD5_COMPARE" was added in "RUNNING." This event triggered MD5 hash comparison.

•
If the first comparison failed, but the second or the third comparisons succeeded, Task would add a "SUCCEED_FALSE" to mark the Attempt being executed successfully but returning a wrong MD5 hash once. At the same time, Task recorded the hostnames of these TaskAttempt machines as evidences of the potential malicious nodes. • "TA_ATTEMPT_SUCCEEDED," "T_ADD_SPEC_ATTEMPT," and "T_ATTEMPT_COMMIT _PENDING" in "SUCCEEDED" must be changed accordingly to control and trigger the state transition.
Finally, the FSM of the SHIYF Job is shown in Figure A3. When the job entered a "RUNNING" state, the entire event would turned into task until it returned the trigger events (e.g., JOB_TASK _ATTEMPT_COMPLETED, JOB_MAP_TASK_RESCHEDULED, JOB_TASK_ATTEMPT_FETCH _FAILURE, JOB_TASK_COMPLETED, and JOB_COMPLETED). The trigger events and the corresponding states marked with red in Figure A3 must be redesigned. For instance, when SHIYF Job received the "JOB_TASK_COMPLETED" trigger event, it not only calculated the numbers of completed tasks, failed tasks, and killed tasks, but also recorded the hostnames of the malicious nodes and the potential ones.
Therefore, the corresponding SHIYF ResourceManager and NodeManager implementations are shown in Figures A4 and A5.

Theoretical Arithmetic
Although the Map speculative task and the Reduce speculative task are slightly different, their principles are the same. Thus, we use the Map task replication as an example to show the theoretical arithmetic.
To easily compare differences and similarities without losing generality, we set every MRv2 job to dispose of the same size of data. Thus, the total blocks were fixed in every experiment; moreover, the data of every block were different. Every Map task that processed only one block implied that the number of the copied blocks was equal to the number of the replicated Map tasks. We assumed the number of blocks (Map tasks) was b. A container was the abstraction conception of a resource set in YARN. It would be allocated by RM and supervised by NodeManager (NM). Every task must be executed in a container; thus, the number of containers was also b.
Despite the security in SHIYF, replicating all the MRv2 tasks by speculative executions is not practical, because doing so consumes considerable resources and time. We introduced Execution Ratio (Er) to indicate that b × E r blocks would be duplicated. We let N be the number of the Map speculative tasks, then If an MRv2 job involves one MRAppMaster and n containers, then m containers might be malicious and m < n. The aims of SHIYF are to ensure the integrity of MRv2 results and find the malicious nodes. Theoretical arithmetic will show the relationship between Detection Ratio (Dratio) and the above parameters as follows: Step 1. P am denotes the probability that a malicious Map task is present in b Map tasks. It is computed as P nm is the probability that any Map task is not malicious, and it is obtained as Step 2. If N duplicated blocks are in an MRv2 job, P Ns denotes the probability that all N Map tasks are secure, then Step 3. In case a Map task is not executed in a secure container, P Nam denotes the probability that a malicious Map speculative task occurs in N at least, then Step 4. We suppose that malicious nodes execute the vicious actions in P probability. We let P mea be the probability of the malicious containers (nodes) executing the vicious actions, and it is obtained as Step 5. We can obtain the probability that the malicious nodes do not conduct the baleful behaviors. P mna is computed as Step 6. The variable t represents the number of jobs executed by MRv2. If the malicious nodes perform the tasks correctly in t MRv2 jobs, then we can obtain this probability P mct as Step 7. In case of the malicious nodes exposing themselves in t MRv2 jobs, the probability P mat can be obtained by Step 8. The aim of SHIYF is to find all the malicious nodes in YARN. Therefore, detection ratio D ratio is obtained by the above derivations, then

Theoretical Results
We experimented on the effects of b, P, t, and E r to the detection ratio D ratio based on the theoretical arithmetic presented in Section 3.1. Figure 8 shows the change of D ratio against the execution ratio E r and the number of the blocks b, where t = 40 and P = 0.2. D ratio increases along with the increase of E r and decreases slightly with the increase of b.
The change of D ratio against the E r and the malicious behavior probability P, where b = 20 and t = 10, are shown in Figure 9. Evidently, D ratio increases along with the increase of P. Given a certain E r , the presence of more malicious behaviors corresponded to the increased effectiveness of the operation of SHIYF.   On the basis of the theoretical derivation results, we can draw the following conclusions: 1. The detection ratio Dratio increased with the increase in the execution ratio Er, the number of jobs t, and the malicious action probability P. 2. The number of blocks b had a minimal impact on Dratio. 3. As long as the number of jobs t was equal or greater than 25, we could set Er at a low level (≤30%) to achieve a desired Dratio (≥85%) when P ≥ 0.2. Moreover, the more P was, the better Dratio was. 4. Furthermore, if we combined map speculative tasks and reduce speculative tasks together, then we could reasonably believe Dratio would be more than 90%.
In conclusion, theoretical derivation indicates that SHIYF is effective for finding malicious behaviors.    On the basis of the theoretical derivation results, we can draw the following conclusions: 1. The detection ratio Dratio increased with the increase in the execution ratio Er, the number of jobs t, and the malicious action probability P. 2. The number of blocks b had a minimal impact on Dratio. 3. As long as the number of jobs t was equal or greater than 25, we could set Er at a low level (≤30%) to achieve a desired Dratio (≥85%) when P ≥ 0.2. Moreover, the more P was, the better Dratio was. 4. Furthermore, if we combined map speculative tasks and reduce speculative tasks together, then we could reasonably believe Dratio would be more than 90%.

SHIYF Experiments
In conclusion, theoretical derivation indicates that SHIYF is effective for finding malicious behaviors. On the basis of the theoretical derivation results, we can draw the following conclusions:
The detection ratio Dratio increased with the increase in the execution ratio Er, the number of jobs t, and the malicious action probability P.

2.
The number of blocks b had a minimal impact on Dratio.

3.
As long as the number of jobs t was equal or greater than 25, we could set Er at a low level (≤30%) to achieve a desired Dratio (≥85%) when P ≥ 0.2. Moreover, the more P was, the better Dratio was.

4.
Furthermore, if we combined map speculative tasks and reduce speculative tasks together, then we could reasonably believe Dratio would be more than 90%.
In conclusion, theoretical derivation indicates that SHIYF is effective for finding malicious behaviors.

SHIYF Experiments
In this section, we evaluate the security, integrity, and performance of SHIYF by conducting three benchmark experiments: WordCount, TestDFSIO, and MRBench.
We deployed the entire SHIYF cluster with an RM node and six NM nodes. The RM machine was equipped with one quad-core 3.9 GHz Intel Xeon E3-1280 V6 CPU, 16 GB memory, one Intel DC S3710 800 GB SSD, and 1000M NIC. Six NM machines were equipped with one quad-core 3.0 GHz Intel Core i5-7400 CPU, 8 GB memory, one 500 GB SATA II disk, and 1000M NIC. All machines had the same software configurations, including Ubuntu Server 16.04 LTS (64-bit), JDK 1.8.0, and Hadoop 2.8.0.
Considering the efficiency of the Hadoop cluster, the local data, and the objective of SHIYF experiments, we configured and optimized the Hadoop cluster first as follows: • The file replication number of HDFS (dfs.replication) was set at 2, because the experiments were executed in a local rack. The minimum size of each file chunk was set at 256 MB to facilitate the processing of large files. To avoid a large number of data copies from the remote machines, the size of the split was set to equal the size of the block. A task disposes of a split.

•
Given that six NM machines were equipped with one quad-core CPU, the value of "mapred.tasktracker.tasks.maximum" was set to 4. The number of reductions equaled 1.75 × (the numbers of NMs × mapred.tasktracker.tasks.maximum), namely, 42. Then, the faster NMs that finished their first round of reduce tasks would launch the second round of reduces immediately, thereby indicating a much improved load balancing.
In addition, we defined three experiment scenarios as follows: 1.
In Hadoop, the speculative execution was open by default.

2.
In SHIYF, the 30% Map and Reduce tasks were selected randomly to check the validity of results; thus, they will execute the speculative tasks and MD5 hash computations.

3.
In SHIYF, two NMs will execute the malicious behaviors and return the wrong MD5 hashes at 20% probability, which is equivalent to the 7%-33.3% malicious nodes in the Hadoop cluster.

Execution Results of SHIYF
In the WordCount benchmark, we chose various test files and compared the time cost in three different scenarios. To calculate big data, all the test files were greater than 256 MB and met the minimum file block setting. The results were the average values of 25 WordCount experiments based on Section 3. The corresponding histogram is shown in Figure 11. In this section, we evaluate the security, integrity, and performance of SHIYF by conducting three benchmark experiments: WordCount, TestDFSIO, and MRBench.
We deployed the entire SHIYF cluster with an RM node and six NM nodes. The RM machine was equipped with one quad-core 3.9 GHz Intel Xeon E3-1280 V6 CPU, 16 GB memory, one Intel DC S3710 800 GB SSD, and 1000M NIC. Six NM machines were equipped with one quad-core 3.0 GHz Intel Core i5-7400 CPU, 8 GB memory, one 500 GB SATA II disk, and 1000M NIC. All machines had the same software configurations, including Ubuntu Server 16.04 LTS (64-bit), JDK 1.8.0, and Hadoop 2.8.0.
Considering the efficiency of the Hadoop cluster, the local data, and the objective of SHIYF experiments, we configured and optimized the Hadoop cluster first as follows: • The file replication number of HDFS (dfs.replication) was set at 2, because the experiments were executed in a local rack. The minimum size of each file chunk was set at 256 MB to facilitate the processing of large files. To avoid a large number of data copies from the remote machines, the size of the split was set to equal the size of the block. A task disposes of a split.

•
Given that six NM machines were equipped with one quad-core CPU, the value of "mapred.tasktracker.tasks.maximum" was set to 4. The number of reductions equaled 1.75 × (the numbers of NMs × mapred.tasktracker.tasks.maximum), namely, 42. Then, the faster NMs that finished their first round of reduce tasks would launch the second round of reduces immediately, thereby indicating a much improved load balancing.
In addition, we defined three experiment scenarios as follows: 1. In Hadoop, the speculative execution was open by default. 2. In SHIYF, the 30% Map and Reduce tasks were selected randomly to check the validity of results; thus, they will execute the speculative tasks and MD5 hash computations. 3. In SHIYF, two NMs will execute the malicious behaviors and return the wrong MD5 hashes at 20% probability, which is equivalent to the 7%-33.3% malicious nodes in the Hadoop cluster.

Execution Results of SHIYF
In the WordCount benchmark, we chose various test files and compared the time cost in three different scenarios. To calculate big data, all the test files were greater than 256 MB and met the minimum file block setting. The results were the average values of 25 WordCount experiments based on Section 3. The corresponding histogram is shown in Figure 11. We supposed that Si was the size of one file and a was its numbers. We let Nb denote the number of blocks, then We supposed that S i was the size of one file and a was its numbers. We let N b denote the number of blocks, then The total sizes o f f iles mapred.min.split.size = n i=1 aS i 256M , a = 1, 2, . . . , n.
Therefore, we could obtain three conclusions: 1.
In the original YARN framework, although the input paths of "60 × 1 G" are 60 times that of "60 G," the time cost increases slightly along with the increase of the total input paths when the numbers of blocks are equal to 240 according to Formula (11 When two malicious NMs are given in SHIYF, the probability of Map/Reduce tasks assigned to them is close to 33.3% because of the load balancing of the Hadoop cluster. Furthermore, the probability of malicious behaviors is 20%. Therefore, the increasing time is mainly due to Task waiting for the returned values of extra speculative TaskAttempts. The increasing time cost of SHIYF is between 16% and 20% compared with that of the original condition.

Malicious Node Detection Ratio of SHIYF
In this section, we verify the malicious node detection ratio of SHIYF. In SHIYF, MRAppMaster will record time, Job_ID, the malicious node's hostname, and right and wrong MD5 hashes in Job logs on HDFS.
All the test files were divided into 240 blocks, with the addition of 30% speculative executions; thus, every NM disposed 52 blocks in Map. Then, "hadoop2" and "hadoop5" were set to execute the malicious behaviors in Map and Reduce, in 20% probability amount, to approximately 22-30 times malicious actions. In 25 WordCount experiments, 20-30 malicious behavior records were found in the logs. The experiment results and the malicious node detection ratio computation are shown in Table 1. A failed task occurs because the three MD5 hashes returned by different TaskAttempts are inconsistent, and comparisons conducted twice are unsuccessful. Thus, Job launched the same Task again. Furthermore, all the hostnames, comparisons, and MD5 hashes in SHIYF are recorded, and we can obtain three conclusions as follows: 1.

2.
The malicious node detection ratio of SHIYF is between 87% and 93.3%. This ratio is in line with the expected theoretical derivation shown in Figure 9 in Section 3. Therefore, "hadoop2"/"hadoop5" are not the malicious NMs; they executed the malicious behaviors in their container tasks only once, and this instance was not chosen as among the verified malicious behaviors.

3.
On the basis of the conclusions of theoretical derivations in Section 3, the detection ratio increases with the increase of the execution ratio Er, the number of jobs t, and the malicious action probability P. Consequently, we believe that SHIYF will have the better malicious node detection ratio when it runs on a larger cluster and test data set.

Resource Utilization of SHIYF
In this section, we monitor the resource consumption of every machine on SHIYF, such as CPU utilization, memory utilization, disk throughput, and network throughput. With "10 × 6 G" taken as the example, Figures 12-14 reveal the resource consumptions of SHIYF on RM and NMs in three situations, respectively. NMs are divided into two types: one includes MRAppMaster and Containers, and the other includes Containers only. 3. On the basis of the conclusions of theoretical derivations in Section 3, the detection ratio increases with the increase of the execution ratio Er, the number of jobs t, and the malicious action probability P. Consequently, we believe that SHIYF will have the better malicious node detection ratio when it runs on a larger cluster and test data set.

Resource Utilization of SHIYF
In this section, we monitor the resource consumption of every machine on SHIYF, such as CPU utilization, memory utilization, disk throughput, and network throughput. With "10 × 6G" taken as the example, Figures 12-14 reveal the resource consumptions of SHIYF on RM and NMs in three situations, respectively. NMs are divided into two types: one includes MRAppMaster and Containers, and the other includes Containers only.

ResourceManager
In RM, ApplicationManager launches one MRAppMaster to control the Job and Scheduler that is responsible for the communication with the NMs. We can obtain the following conclusions from the analysis of Figure 12 Figure 12a shows that the CPU utilization of RM in the WordCount experiment is relatively low except for the initial stage. 2. Adding 30% speculative tasks and 33.3% malicious tasks merely increases a few status monitors to NMs and information communications between RM and NMs; memory utilization remains lower than 36%. Moreover, the memory utilization of RM is markedly smooth, as shown in Figure 12b. 3. Several reference variables are recorded to show the disk influence of SHIYF on RM, including the number of transfers per second "tps," sectors read/written per second "rd_sec/wr_sec," the

ResourceManager
In RM, ApplicationManager launches one MRAppMaster to control the Job and Scheduler that is responsible for the communication with the NMs.
We can obtain the following conclusions from the analysis of Figure 12.

1.
The addition of 30% extra speculative executions and executing MD5 hash computations and comparisons have a weak influence on RM. Figure 12a shows that the CPU utilization of RM in the WordCount experiment is relatively low except for the initial stage.

2.
Adding 30% speculative tasks and 33.3% malicious tasks merely increases a few status monitors to NMs and information communications between RM and NMs; memory utilization remains lower than 36%. Moreover, the memory utilization of RM is markedly smooth, as shown in Figure 12b.

3.
Several reference variables are recorded to show the disk influence of SHIYF on RM, including the number of transfers per second "tps," sectors read/written per second "rd_sec/wr_sec," the average size (in sectors) of the requests that were issued to the device "avgrq-sz," the average queue length of the requests that were issued to the device "avgqu-sz," and so on. We take the most representative parameter "wr_sec/s" as an example. Figure 12c shows that adding 30% speculative tasks and MD5 comparisons has a weak influence on the disk throughput of RM. The primary influences are found in the initial and final phases because more statuses of NMs are transmitted to RM; thus, SHIYF evidently increases the hard disk writing of RM. 4.
Total number of packets received per second "rxpck/s," total number of packets sent per second "txpck/s," and data size received per second "rxkB/s," among others, are recorded for monitoring the network throughput. Taking "rxpck/s" as an example, Figure 12d shows that adding 30% speculative tasks and 33.3% malicious nodes has a minimal influence on the network throughput of RM. Only repeated computing and comparison of MD5 hashes in SHIYF increase some resource applications and NM status reports. average size (in sectors) of the requests that were issued to the device "avgrq-sz," the average queue length of the requests that were issued to the device "avgqu-sz," and so on. We take the most representative parameter "wr_sec/s" as an example. Figure 12c shows that adding 30% speculative tasks and MD5 comparisons has a weak influence on the disk throughput of RM. The primary influences are found in the initial and final phases because more statuses of NMs are transmitted to RM; thus, SHIYF evidently increases the hard disk writing of RM. 4. Total number of packets received per second "rxpck/s," total number of packets sent per second "txpck/s," and data size received per second "rxkB/s," among others, are recorded for monitoring the network throughput. Taking "rxpck/s" as an example, Figure 12d shows that adding 30% speculative tasks and 33.3% malicious nodes has a minimal influence on the network throughput of RM. Only repeated computing and comparison of MD5 hashes in SHIYF increase some resource applications and NM status reports.
NodeManager: NM(MRAppMaster) NMs are divided into two categories; the first includes MRAppMaster and Containers, moreover, the second includes only Containers. Figure 13 reveals the resource utilization of NM (MRAppMaster). We can obtain some conclusions as follows from the analysis of Figure 13. 1. The CPU utilization of NM (MRAppMaster) is shown in Figure 13a. In SHIYF, the lowest CPU occupancy is more than 80%; moreover, the time consumption of Job is longer than that in the original YARN. However, their increases are under 20%, and a lower CPU utilization will occur if SHIYF is built on more powerful clusters. 2. In three conditions the memory utilization of NM (MRAppMaster) is only slightly different, as shown in Figure 13b. NMs are divided into two categories; the first includes MRAppMaster and Containers, moreover, the second includes only Containers. Figure 13 reveals the resource utilization of NM (MRAppMaster).
We can obtain some conclusions as follows from the analysis of Figure 13.

1.
The CPU utilization of NM (MRAppMaster) is shown in Figure 13a. In SHIYF, the lowest CPU occupancy is more than 80%; moreover, the time consumption of Job is longer than that in the original YARN. However, their increases are under 20%, and a lower CPU utilization will occur if SHIYF is built on more powerful clusters.

2.
In three conditions the memory utilization of NM (MRAppMaster) is only slightly different, as shown in Figure 13b The total number of packets transmitted per second (txpck/s) indicates the influence of SHIYF on the network throughput of NM (MRAppMaster), as shown in Figure 13d. SHIYF increases some network communications of NM (MRAppMaster) with RM and other NMs, while adding 30% speculative tasks and 33.3% malicious nodes, because NM (MRAppMaster) must report more node statuses to RM and communicate with more containers. However, the extra overhead is affordable. We can obtain the following conclusions from the analysis of Figure 14: 1. The CPU utilization of NM (Containers) is shown in Figure 14a. Compared with the CPU utilization shown in Figure 13a, it is lower than that of NM (MRAppMaster) in three conditions because NM (MRAppMaster) needs to manage NM, MRAppMaster, and all other containers in the job. 2. The memory utilization of NM (Containers) is also lower than that of NM (MRAppMaster), as shown in Figure 14). Both Figure 13b and Figure 14b show that SHIYF has little effect on the memory utilization of NMs. NodeManager: NM (Containers) Figure 14 shows the resource use of another class of NMs in SHIYF.
We can obtain the following conclusions from the analysis of Figure 14: 1.
The CPU utilization of NM (Containers) is shown in Figure 14a. Compared with the CPU utilization shown in Figure 13a, it is lower than that of NM (MRAppMaster) in three conditions because NM (MRAppMaster) needs to manage NM, MRAppMaster, and all other containers in the job.

2.
The memory utilization of NM (Containers) is also lower than that of NM (MRAppMaster), as shown in Figure 14). Both Figures 13b and 14b show that SHIYF has little effect on the memory utilization of NMs. 3. Figure 14c shows the number of sectors read from the device per second (rd_sec/s) in NM (Containers). Compared with Figure 13c, the disk throughput of NM (Containers) peaks earlier than that of NM (MRAppMaster), and the average throughput is higher. This situation shows that the machine on which MRAppMaster is run allocated fewer containers for dynamic load balancing in the Hadoop cluster. However, the effect of SHIYF is weak in three conditions. 4.
A comparison of Figures 13d and 14d shows that NM (Containers) also needs to report more node statuses to RM and communicate more with NM (MRAppMaster) in SHIYF. By contrast, the resource consumption of NM (Containers) is lower than that of NM (MRAppMaster). Moreover, their overhead is affordable.
Finally, we can draw three conclusions from the WordCount benchmark.
1. SHIYF can locate the malicious nodes and the potential malicious ones. The malicious node detection ratio is between 87% and 93.3%. It is in line with the expected theoretical derivation.

2.
The increasing time cost of SHIYF is between 16% and 20%. Moreover, it has little effect on increasing the resource overhead.

3.
The limited computing ability of the experiment hardware may increase the time cost and resource consumption. We trust that SHIYF will perform better as it executes a much larger range of jobs in a more powerful Hadoop cluster. If so, SHIYF can use a lower speculative execution ratio to achieve high malicious node detection ratios.

Execution Results of SHIYF
In this section, we use TestDFSIO to test the read-and-write file system performance of SHIYF. The intermediate results of TestDFSIO, including "tasks," "size," "time," "rate," and "sqrate," will result in inconformity of MD5 hashes. Considering this particularity, we should configure SHIYF based on three different scenarios.

•
In the original condition, we test the performance of the read-and-write file system of YARN without any modification.

•
In the SHIYF (30% duplicate) condition, we abolish the MD5 comparison of SHIYF temporarily because MD5 is a simple and efficient digital digest algorithm, and no malicious nodes occur in this condition. Moreover, SHIYF has little impact on the total job execution time, as verified in Section 4.1.

•
In the SHIYF (33.3% malicious) condition, every task chosen for checking needs to compute and compare the MD5 hashes of the intermediate or final results. Moreover, every result of TaskAttempt is different. Thus, we keep only the "tasks" and "size" as the Map/Reduce results to ensure that the MD5 hashes of the same TaskAttempts that ran in the secure NMs are equal.
In addition, TestDFSIO launches a MapReduce job to read or write files. The same amount of data is written into or read from HDFS, and four statistics are collected: throughput (mb/sec), average I/O rate (mb/sec), I/O rate std deviation, and test exec time (sec). We executed TestDFSIO 25 times in three conditions. The average values of the experiments are shown in Table 2.

1.
In the three conditions, the read speed is much faster than the write speed. In the beginning, with the increase in file size, the running time decreased, thereby indicating that HDFS was highly suitable for processing large-scale reading and writing data. However, a corresponding increase in average running time occurred along with the increase in file sizes because of the inevitable increase in the number of cluster nodes, the complicated hardware configurations, and other reasons.

2.
Increasing speculative executions by 30% corresponds to an increased 30% TaskAttempts. However, the speculative executions are launched with the original TaskAttempts simultaneously; thus, the time consumption increase is under 8%, as shown in Table 3. This finding is mainly because of the inconsistency of the TaskAttempt completion times, regardless of the "WRITE" test or in the "READ" test. 3.
In theory, adding 33.3% malicious nodes executed malicious behaviors at 20% probability is equivalent to an increase of approximately 6.66% extra speculative executions. The time increase occurred primarily because of the waiting time for the second speculative execution. Moreover, the increase in the total time cost does not exceed by 16%, unlike with the original YARN. Considering location optimization in HDFS, most data are read from the local disk rather than the network with limited bandwidth. Therefore, the read speed is faster than the write speed. The influence of SHIYF on network throughput was computed based on the write throughput (mb/sec) in TestDFSIO as follows: Step 1. We assume N t is the total node number in the SHIYF cluster. D f is the total sizes of the test files (M). We let T s and T hr represent the test execution time (sec) and throughput (mb/sec), respectively. We derive T hr as Step 2. We suppose N f is the number of files. Moreover, each concurrent process conducts one file in MRv2. We let the number of concurrent processes be N p , then Step 3. We let N r denote the "dfs.replication," then N r = 2. Thus, (N r − 1) = 1 network transmissions occur as one file is writing on HDFS. We assume that N wpm is the total number of write processes in every NM. It is computed as Step 4. We can obtain the formula of network throughput NT hr Combined with Table 3, we can calculate the network throughput in three conditions, as shown in Table 4. Two conclusions can be drawn as follows: 1.
More files written on HDFS correspond to more copied files transmitted on the network. Therefore, the network throughput is higher.

2.
The highest network throughput is 64.591 M/s, thereby indicating that the highest growth rate of network throughput is 28.28%. However, this value is far below the bandwidth of a gigabit network (128 M/s). Thus, no significant bandwidth load occurs. Instead, the process improves network bandwidth utilization.

Influence of SHIYF to HDFS
To examine the influence of SHIYF on HDFS, we chose "1 × 60 G" as the example due to its execution time being the longest in TestDFSIO benchmark. The number of sectors read from/written to NM (Containers) per second (rd_sec/s, wr_sec/s) can intuitively demonstrate the influence of SHIYF on HDFS, as shown in Figure 15. Two conclusions can be drawn as follows: 1. More files written on HDFS correspond to more copied files transmitted on the network. Therefore, the network throughput is higher. We can obtain three conclusions as follows. 1. More speculative executions correspond to more data read from or written to HDFS. However, the changes in the curves in the three conditions were minimal; moreover, they interlaced and partially overlapped. 2. Although SHIYF improves the use and efficiency of the disk, it does not increase the hard disk load. The minimal change follows the ideal states in three conditions. We can obtain three conclusions as follows.

1.
More speculative executions correspond to more data read from or written to HDFS. However, the changes in the curves in the three conditions were minimal; moreover, they interlaced and partially overlapped.

2.
Although SHIYF improves the use and efficiency of the disk, it does not increase the hard disk load. The minimal change follows the ideal states in three conditions. 3.
SHIYF impacts the time of TestDFSIO. However, it has no effect on the read and write performance of HDFS.
In this section, we test SHIYF in the TestDFSIO benchmark and show the influence of SHIYF on network throughput and HDFS. Although SHIYF strengthens the security and integrity of YARN using the speculative executions and MD5 algorithm, it can also maintain the Input/Output (I/O) performance of HDFS. The slight growth of network throughput and time mainly results from the increasing speculative executions and the extra waiting time. Moreover, the overhead is affordable.

Execution Results of SHIYF
MRBench repeats a minor job many times, as specified by the user, to check whether the minor job running on a Hadoop cluster is repeatable and efficient. MRBench is used to test the performance to handle many minor jobs, and it has the security protection ability of SHIYF. Therefore, the times of job repetition t are set as 10,15,20,25,30, and 40, based on Section 3. Several parameters should be set as follows: • inputLines = 1000. The number of every generated file is 1000 lines. The experiment results in three scenarios are shown in Figure 16. We obtain three conclusions.

1.
In these three conditions, every experiment with the same configuration is executed with different repetition times. The execution time in Figure 16 is an average value of SHIYF that conducted the same job several times. More repetition times correspond to increased accuracy of the execution time.

2.
Adding 30% speculative executions makes the MRBench time increase by approximately 9%, mainly due to the inconsistent completion time of TaskAttempts. Moreover, this process increases MD5 hash computations and comparisons. 3.
In the 33.3% malicious nodes condition, execution time increases by approximately 16% because of the extra speculative TaskAttempts and the inconformity of two comparative MD5 hashes.

Malicious Node Location of SHIYF
MRBench is also used to test the security protection ability of SHIYF in locating malicious nodes. We chose the experiment "t = 25" as the example following Section 4.3.1. The parameters "maps = 200" and "reduces = 100" decide that 200 + 100 = 300 tasks are used in every MRBench benchmark. Hadoop2 and hadoop5 are two malicious nodes in the Hadoop cluster, and their probability of exhibiting malicious behaviors is 20%. Therefore, the upper limit of malicious tasks executed by hadoop2/hadoop5 is approximately 300/6 × 20% = 10 times in MRBench. When MRBench is executed successfully and has achieved the goal of locating the malicious nodes, the upper limit of CA that withdraws all resource applications of the malicious nodes must be altered to 15(>10) rather than 5 in Section 2.2.1.
2. Adding 30% speculative executions makes the MRBench time increase by approximately 9%, mainly due to the inconsistent completion time of TaskAttempts. Moreover, this process increases MD5 hash computations and comparisons. 3. In the 33.3% malicious nodes condition, execution time increases by approximately 16% because of the extra speculative TaskAttempts and the inconformity of two comparative MD5 hashes.

Malicious Node Location of SHIYF
MRBench is also used to test the security protection ability of SHIYF in locating malicious nodes. We chose the experiment "t = 25" as the example following Section 4.3.1. The parameters "maps = 200" and "reduces = 100" decide that + 200 100 = 300 tasks are used in every MRBench benchmark.
Hadoop2 and hadoop5 are two malicious nodes in the Hadoop cluster, and their probability of exhibiting malicious behaviors is 20%. Therefore, the upper limit of malicious tasks executed by hadoop2/hadoop5 is approximately 300 / 6 20% = 10  times in MRBench. When MRBench is executed successfully and has achieved the goal of locating the malicious nodes, the upper limit of CA that withdraws all resource applications of the malicious nodes must be altered to 15(>10) rather than 5 in Section 2.2.1.
After 25 MRBench experiments were performed, we check the logs of Job and compute the average values, as shown in Table 5. We obtain four conclusions. 1. Any malicious action record about hadoop1 is found in Job logs. Thus, it is a secure NM. 2. The average value of hadoop3/hadoop6 records is between 0 and 1, mainly because two continuous failed MD5 hash verification records would be recorded in 25 experiments. Therefore, they might be the potential malicious NMs. Although they were the secure NMs, they were considered the potential malicious ones if they validated the result as the malicious NMs at the same time and the results were inconsistent. 3. The average value of hadoop2/hadoop5 malicious behaviors is 9. We can judge them as the malicious NMs in the Hadoop cluster. Therefore, the malicious node detection ratio of SHIYF is at least 90% in the MRBench benchmark. 4. Not only can SHIYF achieve a high malicious node detection ratio, but it can also locate the malicious nodes and the potential ones accurately.   Table 5. We obtain four conclusions.

1.
Any malicious action record about hadoop1 is found in Job logs. Thus, it is a secure NM.

2.
The average value of hadoop3/hadoop6 records is between 0 and 1, mainly because two continuous failed MD5 hash verification records would be recorded in 25 experiments. Therefore, they might be the potential malicious NMs. Although they were the secure NMs, they were considered the potential malicious ones if they validated the result as the malicious NMs at the same time and the results were inconsistent. 3.
The average value of hadoop2/hadoop5 malicious behaviors is 9. We can judge them as the malicious NMs in the Hadoop cluster. Therefore, the malicious node detection ratio of SHIYF is at least 90% in the MRBench benchmark.

4.
Not only can SHIYF achieve a high malicious node detection ratio, but it can also locate the malicious nodes and the potential ones accurately.

Conclusions and Future Work
SHIYF is proposed in this paper. Through theoretical derivation, we set the relevant parameters of SHIYF accurately and implemented the prototype framework SHIYF based on Hadoop 2.8.0. The framework advantage of speculative execution and MD5 hash verification is that they ensure the integrity and validity of MapReduce 2.0 results. Moreover, SHIYF is able to locate the malicious and potentially malicious nodes in the Hadoop cluster.
Three experiments on SHIYF adequately demonstrate its malicious node detection D ratio , and resource consumption can achieve the expected goals. In particular, D ratio is at least 87%, while the overhead is increased only slightly. Therefore, the proposed SHIYF will use the lower speculative execution ratio and consumes less resources to achieve a desirable D ratio as long as it runs on a more powerful machine cluster and disposes of more jobs.
However, adding 30% speculative tasks in SHIYF is still a few wasted resources. We will work hard to reduce the ratio of speculative tasks and improve D ratio . Using 15% speculative execution ratio to achieve more than 95% D ratio is a much better tradeoff between resource usage and security. Meanwhile the efficiency of SHIYF will also be promoted. In addition, non-collusive malicious nodes are found in the experiment environment. If several collusive attackers are found in the Hadoop cluster, then they might return the same wrong MD5 hashes when they are incorrectly considered the secure nodes. Therefore, our future research will focus on improving the tradeoff between performance and security in SHIYF, moreover, preventing collusive malicious nodes.      Figure A4. SHIYF NodeManager implementation. Figure A4. SHIYF NodeManager implementation.