A Federated Learning Multi-Task Scheduling Mechanism Based on Trusted Computing Sandbox

At present, some studies have combined federated learning with blockchain, so that participants can conduct federated learning tasks under decentralized conditions, sharing and aggregating model parameters. However, these schemes do not take into account the trusted supervision of federated learning and the case of malicious node attacks. This paper introduces the concept of a trusted computing sandbox to solve this problem. A federated learning multi-task scheduling mechanism based on a trusted computing sandbox is designed and a decentralized trusted computing sandbox composed of computing resources provided by each participant is constructed as a state channel. The training process of the model is carried out in the channel and the malicious behavior is supervised by the smart contract, ensuring the data privacy of the participant node and the reliability of the calculation during the training process. In addition, considering the resource heterogeneity of participant nodes, the deep reinforcement learning method was used in this paper to solve the resource scheduling optimization problem in the process of constructing the state channel. The proposed algorithm aims to minimize the completion time of the system and improve the efficiency of the system while meeting the requirements of tasks on service quality as much as possible. Experimental results show that the proposed algorithm has better performance than the traditional heuristic algorithm and meta-heuristic algorithm.


Introduction
In the past decade, the global data flow has been growing at an unprecedented speed, and the value behind the data has been paid more and more attention. Intelligent applications supported by massive data will further promote the coordinated development of power enterprises and financial services in the industrial chain. However, due to the problem of data isolation between different departments and systems, the cost of crossdomain data circulation is high, and there is a risk of privacy disclosure, which hinders the full release of the potential value of data.
Traditional centralized financial services in the industrial chain generally have problems such as high cost of verification, incomplete information, difficulty in the supervision of repeated financing, fake financing, and increased financing costs. As a distributed ledger technology [1], blockchain has emerged as a solution to the problem of secure data sharing, providing participants with high-quality data and secure data sharing. At present, some researchers combine industry chain finance with blockchain and integrate blockchain technology into complex business scenarios of industry chain finance. This also brings the problem of the security of the industrial chain financial data privacy in the multi-party data-sharing environment. Although blockchain-based multi-party data sharing can make the data open and transparent, it cannot guarantee the data privacy security of users. How to protect data privacy in the case of data sharing is a hot topic in current research.
Federated learning (FL), as a new data value-sharing method, aims to solve the problem of data isolation and protect data privacy. Different from traditional machine learning, participants in federated learning do not need to upload their local data but only need to upload parameters and models trained with local data to the central server, and then, the central server will aggregate the global model, thus protecting the privacy and security of their data.
Traditional federated learning relies on a single central server. If this weak central server fails or is attacked, the global model will be inaccurate and even the whole federated learning process will be hindered. In addition, the inability to guarantee whether the central server itself has malicious behavior means that the data privacy and security of each participant will not be guaranteed. The combination of blockchain and federated learning can well protect the privacy of user data and share data value in a decentralized scenario. There have been many studies that use the characteristics of blockchain to replace the central server. At the same time, the nodes with good performance are provided with corresponding rewards; thus, the nodes with better data are encouraged to participate in the training more actively. These mechanisms and schemes use blockchain technology to verify and supervise the original data and final calculated results of federated learning. Although this approach can share the value of data and protect the privacy of users to a certain extent by keeping the data local, it does not take into account whether the computation is credible in the training process of the federated learning model and the privacy leakage caused by malicious node attacks. It ignores the trust supervision in the training and aggregation process of the federated learning model. Therefore, it is a challenging task to build a trusted computing framework that ensures data privacy and security for all participants.
In this paper, to solve this problem, we design a federal learning training supervision mechanism based on state channels by introducing the concept of a computational sandbox in trusted computing. The major contributions to this work are as follows: • A trusted regulatory framework for federated learning training transactions based on blockchain state channel is designed, and a decentralized trusted computing sandbox composed of computing resources provided by each participant is constructed. The model training process is carried out in the state channel, and the malicious behavior is supervised by a smart contract. • Aiming at the resource heterogeneity problem of the federated learning participant node, a participant resource management method was proposed to model the resource scheduling optimization problem existing in the process of constructing the state channel. The deep reinforcement learning (DRL) method was used to solve the proposed optimization problem, minimizing the maximum completion time of the system under the condition of meeting the requirements of tasks on service quality. • We compare the resource-scheduling algorithm based on DRL with the traditional heuristic and meta-heuristic algorithms through simulation experiments. The experiments show that the algorithm has better performance in meeting the requirements of tasks on service quality and reducing the maximum completion time of the system.
The rest of this paper is organized as follows. Section 2 describes the related work. Section 3 introduces the system framework and workflow of the mechanism. In Section 4, the resource-scheduling problem in the process of constructing a state channel is modeled. Section 5 presents the resource scheduling algorithm based on DRL. In Section 6, the proposed algorithm is verified by experimental simulation. Section 7 summarizes the work of the thesis.

Decentralized Federated Learning Framework
Federated learning has become increasingly important for modern machine learning in data privacy-sensitive scenarios. The existing federated learning mainly adopts the network topology based on a central server [2][3][4][5]. However, in some cases, this connection method is not suitable. For example, there is no central server connecting all users; the communication cost to the central server is unbearable; the central server cannot be fully trusted.
Thus, to further protect data privacy and avoid communication bottlenecks, decentralized architectures have been proposed recently [6][7][8][9][10][11][12]. Decentralized architectures remove a central node and each node only communicates with its neighbors (trust each other) by exchanging its local model. Ref. [13] provides a systematic analysis of decentralized learning.
Blockchain is one of the most popular disruptive technologies [14]. With the characteristics of decentralization and data not easily tampered with, it can provide a high degree of guarantee for secure data collection and sharing, paving the way for emerging financial and industrial services.
Many studies have combined blockchain technology with federated learning to ensure the value of sharing data in the context of decentralization and data privacy. Ref. [15] designed a distributed multi-party secure data-sharing architecture based on blockchain authorization to transform the data-sharing problem into a machine learning problem. By sharing the data model instead of revealing the actual data, data privacy was well maintained. To balance the issues of private security and efficiency in fog computing, ref. [16] proposes a new federated learning (FL-Block) scheme based on blockchain, which allows local devices to exchange and update the global model through blockchain. Ref. [17] designed a federated learning system under the blockchain scenario, which uses the reputation mechanism to assist home appliance manufacturers to train machine learning models and predict future consumer demand and consumption behavior. However, these schemes do not guarantee the reliability and security of the training process. The trained model and parameters will be sent to other nodes for aggregation. If other nodes have malicious behaviors, personal privacy information can still be obtained by attacking FL model parameters.

Resource Scheduling Problem
The existing methods for solving scheduling problems include dynamic programming, probabilistic algorithm, heuristic algorithm, meta-heuristic algorithm, hybrid algorithm, and deep reinforcement learning methods.
The resource-scheduling problem is often abstracted as a target optimization problem. Traditional heuristic resource scheduling algorithms include first-come-first-serve FCFS [18], RR [19], Min-min [20], Max-min [20], etc. Metaheuristic algorithms are algorithms inspired by biological behavior and natural phenomena that imitate biological behavior, including genetic algorithm [21], particle swarm optimization algorithm [22], ant colony algorithm [23], etc. Genetic algorithm [21] applies the principle of biological evolution to obtain high-quality solutions from the search space in polynomial time, and it generates new solutions by randomly modifying better solutions. Ref. [24] proposes a hybridized monarch butterfly optimization algorithm, which is suitable for solving cloudscheduling problems. Ref. [25] combines two optimization algorithms, namely CS (cuckoo search) and PSO (particle swarm optimization search), to reduce the completion time, cost, and deadline default rate.
In practical application, the cloud system has the following characteristics: (1) the system is large and complex, and cannot be modeled accurately; (2) the timeliness of scheduling decisions requires a high-speed scheduling algorithm; (3) randomness of tasks (or requests), including randomness of task number, arrival time and size. These characteristics pose a challenge to resource-scheduling research in cloud computing. It is difficult for a particular meta-heuristic or heuristic algorithm to fully adapt to real dynamic cloud computing systems or edge cloud computing systems. Deep reinforcement learning (DRL) is a new method of machine learning, which combines the advantages of deep neural networks and reinforcement learning and has been used to solve resource-scheduling problems of cloud computing in recent years. It has been proved to have strong advantages in many scenarios, especially in complex scenarios of cloud computing [26][27][28][29][30][31]. Ref. [32] proposed a task-scheduling framework based on Q learning. All requests are prioritized and then tasks are assigned to the virtual machine using a constantly updated policy that minimizes task response time and maximizes CPU utilization. Ref. [33] established a model based on a reinforcement learning K-mean algorithm and developed a physical resource allocation scheme to meet the service quality requirements of users. Ref. [26] proposes a deep reinforcement learning solution based on DRL to effectively solve different cloud resource management problems. The above scheme provides the model and algorithm of the network resource scheduling problem but does not consider the characteristics of federated learning and the training environment.
Aiming at the problems of model attacks of malicious nodes in the federated learning process and whether the computing process is safe and reliable, this paper introduces the concept of computing sandbox to establish the state channel in the federated learning process, ensuring the security and trust of computing in the process of model training and the supervision of malicious behaviors. To solve the container-based resource management problem under this framework, we also propose a resource scheduling strategy for federated learning tasks based on the requirements of tasks on service quality and system completion time, improving the resource utilization rate of network collaborative federated learning and improving the efficiency of the system to complete federated learning tasks.

System Framework Structure
The system architecture constructed in this paper is shown in Figure 1. By introducing the concept of computational sandbox in trusted computing, a federated learning framework based on state channels is constructed in the blockchain scenario. The model parameter passing and aggregation of federated learning tasks are regulated in the computational sandbox state channel, and the occupied resources are released after the task is completed. The result is then credibly transmitted to the requester. This process supports the trusted calculation and sharing of data, which is invisible to the training participants. The whole system framework is mainly divided into three parts: participant nodes, blockchain and state channel.
The participant nodes can be either the requester or the local training node of the federated learning task. Each participant has user data and resources with training value, representing a certain industry or enterprise, such as financial institutions, credit investigation departments, etc. Participant nodes do not want to share user data with other parties but rather want to share the value of data through federated learning by co-training the global model with local data. The participant node can apply to the blockchain for registration as a training node, and it can apply to cooperate with other participant nodes to conduct federated learning tasks and train the global model.
In the traditional federated learning process, locally trained model parameters need to be sent to the central server or aggregation node, where the global model can be aggregated. In this case, the single point of attack on the aggregation node can easily cause the collapse of the entire federated learning system, local data privacy leakage, and other security problems. To this end, we introduce a computing sandbox as a blockchain state channel. The global model aggregation of federated learning tasks and the calculation of parameters depend on the computing resources and space provided by each participant. We perform the computation in a state channel, which is regulated by the blockchain. In the process of constructing the state channel, the blockchain reasonably schedules resources to the federated learning task according to the requirements of different tasks, making the whole system work more efficiently. The scheduling strategy determines the efficiency and resource utilization of the whole system. Federated learning tasks do not own or retain allocated resources. Instead, the system dynamically allocates them based on current needs using scheduling algorithms to make full use of resources and release used resources immediately after the task is completed. The resource-scheduling problem and scheduling policy are discussed in Sections 4 and 5.
Blockchain is used to manage the status and verification information of each participant, integrate and virtualize various resources registered by each participant, such as computing resources and storage space, and coordinate and supervise the federated learning of each participant through smart contracts. Considering that different participant nodes have different computing power and storage resources, some participant nodes do not have the ability to complete the model training task independently. By integrating the resources of all nodes and constructing state channels with reasonable allocation and scheduling, the problem of resource heterogeneity can be well coordinated and solved. When the task request arrives, the blockchain will conduct resource scheduling, reasonably allocate resources to the task, and establish the federal learning state channel. After that, the training of the federated learning task model and the transmission of parameters will be carried out in the state channel. Any malicious behavior will be supervised by all nodes and reported to the blockchain to ensure the reliable calculation of data and the security of the model training aggregation process.
To protect the privacy and security of user data, the real physical address of the state channel should not be disclosed to all participant nodes. The blockchain is responsible for masking the real physical address of the state channel from the participant nodes and only providing the virtual address and interface to the participant nodes. The federated learning task is completed in the state channel, and each participant node cannot obtain the model and parameters of other participant nodes, which means the participant node cannot attack the original user data.

System Work Flow
As shown in Figure 2, the system workflow is mainly divided into the following steps. In the first step, a participant who wants to instantiate a federated learning task to aggregate the global model and share the value of the data must first register with the blockchain as a training node. Participants submit identity information and describe the data they own and the resources they can provide, including computing unit operation rate, storage space size, cost, etc., for allocation and scheduling by the blockchain.
In the second step, the blockchain authenticates the applied node and integrates the resources provided by the node to establish a virtual resource pool. The blockchain also maintains a virtual resource status table, which records the allocation and usage of resources, the mapping of virtual ports to physical addresses, and so on.
In the third and fourth steps, the participant node applies to instantiate a federated learning training task, and the blockchain selects the training node according to the data description information provided during the registration of each training node.
In the 5th step, each participant node negotiates the relevant parameters of the federated learning training task, including the number of training rounds, the initialization model, the total budget cost, etc., and uses digital signature technology to sign a smart contract.
In the 6th step, the smart contract of blockchain instantiates a federated learning task, determines the initialization priority of the task, saves its initial state on the chain, and places it in the task queue to wait for the allocation and scheduling of computing resources.
In the 7th step, the blockchain reasonably allocates the federated learning tasks in the waiting state to different virtual computing units. The blockchain obtains the physical address of the virtual resource by searching the resource state table and then encrypts it, establishing the trusted computing sandbox as the state channel and updating the resource usage state. The participant node only knows the information of the allocated virtual resources, and it cannot obtain the real physical address of the computational sandbox, so it cannot carry out malicious attacks on the model in training.
In the 8th to 12th steps, participants perform specific federated learning training. The smart contract initializes the model and sends it to each training node. Then, the nodes update the local model with local data and upload the gradient and other parameter information to the state channel for global model aggregation. After the aggregation is completed, the updated global model is sent to each training node for a new round of training until the training is completed. In the training process, the behavior and status of each node are supervised by all nodes. Once malicious behavior is found, it is immediately reported to the blockchain and the malicious node is punished accordingly.
In the 13th to 15th steps, after completing the federated learning training task, the training node updates the final results and status information to the blockchain and then immediately releases the occupied resources, closing the state channel.
In the 16th and 17th steps, the blockchain sends the final result to the task requester, updates the status information of virtual resources, and waits for the arrival of new tasks for resource allocation and scheduling.
In the following sections, we will focus on resource scheduling during the establishment of the state channel and discuss how to allocate resources to make the whole system work more efficiently while satisfying the requirements of tasks on service quality as much as possible.

Resource Scheduling Problem Modeling
In the process of constructing state channels for federated learning tasks, different resource scheduling schemes may affect the working efficiency of the system and the quality of completion of tasks. Improper scheduling policies may cause the computing load of some nodes to be too heavy and cause faults, while the resource utilization of other nodes decreases, resulting in resource waste and greatly reducing system efficiency. There may also be unreasonable resource allocation schemes that make it difficult to meet the requirements of tasks on service quality, such as exceeding the budget and the expected completion time. In this section, a model of the resource-scheduling optimization problem under the state channel-based federated learning training supervision mechanism is constructed.
In this paper, the resource scheduling problem is defined as how to assign multiple federated tasks to multiple computing nodes, such as abnormal behavior detection task, risk assessment task, customer behavior analysis task, product intelligence recommendation task, etc., so as to obtain a scheduling scheme that minimizes the completion time of the whole system under the constraints of task cost and task completion time.
We assume that the training node set of the participants is denoted as Nodes = {N 1 , N 2 , . . . , N N node }, where N i indicates the node i and N node indicates the number of nodes. Nodes are represented as N i = {PU 1 , PU 2 , . . . , PU N i pu }, where PU j (j = 1, 2, . . . , N pu ) represents the j-th processing unit, and N i pu represents the number of processing units of the i-th node. The processing unit is represented as PU = {SIDP, E, COST, TD}, where SIDP is the number of the processing unit, E is the execution capacity of the processing unit, COST is the cost that needs to be paid for using the processing unit to execute the unit time, and TD is the communication delay between the node where PU is located and other nodes. TD i,j indicates the communication delay from node i to node j.
We assume that the federated learning task set is expressed as Tasks = {T 1 , T 2 , . . . , T N task }, where T i represents task i and N task represents the number of tasks. Tasks are represented as T = {SIDT, WorkLoad, MaxT, MaxC, PI, N train }, where SIDT is the serial number of the learning task, WorkLoad is on behalf of its quota, MaxT represents the maximum completion time a task can bear, and MaxC represents the maximum cost a task can bear. In this paper, MaxT and MaxC are taken as indicators of the requirements of tasks on service quality, PI represents the priority of the task, and N train represents the set of training nodes participating in the task. Therefore, we can use a matrix to represent the task resource allocation scheme. The allocation matrix is defined as follows: where X is the task resource allocation matrix with the size of N task * N pu , and x i,j represents the assignment of the i-th task on the j-th block processing unit, which is defined as According to the task resource allocation matrix, we can calculate the completion time of each task. The calculation formula is as follows: Similarly, the cost of each task can be calculated by the following formula: We can also calculate the maximum completion time of the system by the following formula: Finally, resource scheduling is abstracted as a goal optimization problem, that is, solving the resource allocation matrix X meets the following conditions: The specific system parameters in this section are shown in Table 1. Communication delay from node i to node j Tasks Federated learning task set N task The number of tasks SIDT The serial number of the task

WorkLoad
The workload of a task, expressed as the number of instructions

MaxT
The maximum completion time the task can tolerate MaxC The maximum cost that the task can bear PI The priority of the task N train Set of training nodes for the task The distribution of the ith task on the jth processing unit ECT i Completion time of task i TCost i The cost of task i maxMakespan Maximum system completion time

Algorithm Framework and Mechanism
In the complex federated learning training environment based on blockchain, the resource-scheduling algorithm needs to select the optimal resource allocation scheme according to the current node resource state and the federated learning task waiting for resource allocation. In this paper, an Actor-Critic resource-scheduling algorithm based on DRL has been designed. The algorithm framework is shown in Figure 3, which is mainly divided into two parts, environment and agent.
The environment is responsible for monitoring the state of each node, managing the computing resources they own, and maintaining a list of tasks awaiting resource allocation. The environment provides the current state information to the agent, including the state of the current resource and the state of the task to be assigned, and it returns the reward of the current operation and the state of the next time according to the actions made by the agent.
The agent is a resource scheduler, making decisions and choosing actions based on the state given by the environment. The system allocates processing resources to tasks based on actions. The agent is composed of the Critic value network and the Actor policy network. The Critic network can score and evaluate the current state, while the Actor network selects the current optimal action according to the environment state. Traditional Q learning has some disadvantages such as a waste of previous experience and correlation of parameter updating. This paper adopted the method of experience replay, set the experience buffer pool R, and stored the environmental state, the action performed, the reward received, and the state of the next moment as experience samples in the buffer pool. When training network parameters, small batch data can be randomly extracted for training and parameter update so as to make better use of the previous experience and break the correlation of parameter update.
Before resource scheduling, the system sorts the tasks waiting for resource allocation based on their priorities and then allocates resources to them. Suppose that the priority of each task is denoted as Priorities = {PI 1 , PI 2 , . . . , PI N task }, where PI i represents the priority of task i and it is represented as PI i = initialPr i + waitTime i , where initialPr i is the initial priority of the task and related to the WorkLoad i . The lower the workload, the higher the priority, which takes advantage of the short-job-first strategy and helps to reduce the average wait time. waitTime i is the time the ith task is waiting for resource allocation, which is a dynamic value. The longer the waiting time, the higher the priority, which is conducive to reducing the problem of operation hunger.
When training the Actor-Critic algorithm network, the system makes an action a t based on the policy network according to the current environment state s t . Then, the reward r t is obtained by interacting with the environment, and the new environment state s t+1 is obtained, and then, (s t , a t , r t , s t+1 ) is put into the buffer pool R as a sample. Finally, N groups of data samples were randomly selected from the buffer pool, and the Actor-Critic network parameters were updated by calculating the average value of the gradient.

Train Critic Value Network
Define the discount return U t as follows: U t = r t + γr t+1 + γ 2 r t+2 + · · · + γ n r t+n (7) where U t represents the cumulative reward since time t, r t represents the reward at time t, γ ∈ (0, 1) is the parameter, and the subsequent reward should have a lower weight than the current reward. In order to judge whether an action is good or bad in the current state, the action value function is defined as follows: where Q π (s t , a t ) represents the expectation of discount return U t , and s t and a t are the state at time t and the actions made under this state respectively. In order to further judge the quality of a certain state, we define the state value function as follows: where V π (s t ) represents the expectation of action value function to action A, namely the score of the current state s t . In this paper, a neural network v(s; w) is used to approximate the state value function. The purpose of the training value network is to make the system score the state more and more accurately, closer to the return given by the real environment.
In this paper, we use the sum of the reward given by the current action and the rating of the state at the next time as an estimate of the true reward, which is defined as follows: where y t is the estimate of U t and the target value for training the neural network. Therefore, the loss function can be expressed as: So, the gradient of the value network is: where δ t is represented as: Since v(s t ; w) is not the actual observed return, but the estimate of the real discount return, which is often prone to bias. Updating the value network parameters through bootstrapping is prone to overestimation, which is aggravated by continuing to use the already overestimated network to predict the next training values. Therefore, we use target value network v(s t ; w ) for estimation. The v(s t ; w ) and v(s t ; w) have the same structure, and the same initial value, but different updating methods. Each time y t is calculated, the v(s t ; w ) is used for estimation. The calculation formula is as follows.
Other calculations remain the same, and the following formula is used when w is updated.

Train Actor Policy Network
The policy function π(a | s) can calculate the probability distribution of the action according to the current state, which is used to make decisions according to the environmental state. This paper uses a neural network π(a | s; θ) to approximate the policy function. The goal of training strategy networks is to obtain a bigger discount on the output of actions in the face of different state inputs.
The traditional calculation method of policy gradient is: In order to improve the convergence efficiency of reinforcement learning algorithm, this paper adopts the policy gradient calculation method based on a baseline, and the gradient calculation formula becomes: where b is the parameter independent of action a t . By subtracting b from U t , the variance of the gradient is reduced to accelerate the convergence rate of the algorithm. For b, we choose the state value network v(s t ; w) and use Equation (14) to estimate U t , and we finally obtain the calculation method of policy gradient based on the baseline:

MDP Model
To use the deep reinforcement learning method to solve the resource-scheduling optimization problem proposed in Section 4, we expressed the problem as an MDP model, which mainly includes state space, action space, and reward. The specific design is as follows.
The state space should include the allocation of each processing resource at time t and the specific state of the task to be allocated at time t, such as the workload, budget, and the maximum completion time of the task that can be tolerated. Therefore, the state space is defined as: where WorkLoad i , MaxT i and MaxC i , respectively, represent the workload, maximum completion time that can be tolerated, and the maximum budget cost of the i-th task. The action space represents the corresponding actions that the agent can perform in the state s t . The system will remove the task with the highest priority from the wait queue and allocate resources to it. The actions are defined as follows: where γ i ∈ {0, 1} indicates how the task is allocated on the i-th resource: 0 for unallocated and 1 for allocated. The reward is the score that the agent obtains after taking an action based on the current state of the environment to evaluate the goodness of the action. The system calculates the reward of the action according to whether the requirements of tasks on service quality are satisfied after resource allocation and the completion time of the system. With the maximum completion time of the system as the standard of reward, if the action of resource allocation causes the completion time and cost of the task to be greater than the maximum that the task can endure, there will be a huge penalty. The reward can be defined as: where ε is the system parameter.

Proposed Algorithm
The specific steps of the algorithm are shown in Algorithm 1. The algorithm is trained for M epochs, and the network parameters are updated T times in each epoch. At the beginning of each epoch, a task waiting queue is randomly initialized, and a random initial state is obtained by observing the environment. Before each parameter update, the system obtains a set of (s t , a t , r t , s t+1 ) samples from the environment and puts them into the buffer pool R. When the amount of data in R reaches the set value, N groups of samples are randomly selected from R for batch gradient update of network parameters.
Algorithm 1 A resource-scheduling algorithm based on Actor-Critic.

Input: Task queues;
Output: A resource-scheduling model; 1: Initialize the Actor network π(a | s; θ). 2: Initialize the Critic network v(s t ; w) and target value network v(s t ; w ). 3: Initialize the experience replay buffer pool R. 4: for all j = 1, 2, · · · , M do 5: Randomly initialize a task-waiting queue and observe an initial random state s t . 6: for all i = 1, 2, · · · , T do 7: Select an action a t based on the Actor network π(a t | s t ; θ).

8:
Obtain the environment state of the next moment s t+1 and calculate the corresponding reward of the action a t according to Equation 21.

9:
The multiple sets of samples (s t , a t , r t , s t+1 ) obtained by repeating the above process several times are put into the experience replay buffer pool R.

10:
If the amount of data in R reaches the set value, N batches of samples (s i , a i , r i , s i+1 ) are randomly selected from R to train the parameters in the neural network.

11:
Calculate y i according to Equation 14.

13:
Calculate the policy network gradient:

14:
Calculate the value network gradient: g w = 1 N ∑ i δ i · ∂v(s i ;w) ∂w 15: Update policy network: θ ← θ + β · g θ 16: Update value network: w ← w − α · g w 17: After iteration k times, update the target value network according to Equation 15. 18: end for 19: end for In the gradient update process, y i and δ i are calculated using Equations (13) and (14). Then, the gradient of the policy network is calculated by weighted average according to Equation (18). y i served as the target value label of the value network, and the gradient of the value network is calculated by weighted average according to Equation (12). The calculated gradient is used to update the parameters of the policy network and value network, respectively. Finally, the parameters of the target value network are updated after iteration k times.

Experimental Settings
To simulate and verify the effectiveness of the proposed deep reinforcement learning algorithm in solving resource-scheduling problems, this paper uses the Gym framework of OpenAI to model the environment involved in resource-scheduling problems, and it uses the open source framework Tianshou [34] based on PyTorch to conduct simulation experiments on the proposed algorithm. The experiment simulates a scenario in which the system allocates and schedules incoming tasks according to existing computing resources in the process of constructing the state channel in the blockchain environment. The scenario consists of 10 processing units and 100-1000 federated learning tasks with different requirements on service quality. Each unit has a computation rate of 500-2000 MIPS and an overhead of 5-20 per unit of time. The workload length of each task is 100-4000 MI, the budget is 10-50, and the maximum completion time of the task that can be tolerated is 5-10 s. The setting of specific experimental parameters is shown in Table 2.
The Actor-Critic algorithm has 100 epochs, and 5000 updates of network parameters are performed in each epoch. The buffer size is set to 2000, and the learning rate of both the Actor network and Critic network is 1 × 10 −4 . The specific algorithm parameter settings are shown in Table 3.
In this paper, several algorithms are selected to compare with the proposed algorithm (A2C). (1) In the random selection algorithm, each federated learning task is randomly allocated a block of computing resources. (2) The greedy algorithm only considers scheduling schemes that can minimize the maximum completion time of the system. (3) Genetic algorithm (GA) [21] is designed and proposed according to the biological evolution law in nature and simulates the natural selection and biological evolution process, which is a common method to solve the problem of objective optimization.

Analysis of Results
In this paper, several evaluation indicators of the algorithm are set as follows: • maxMakespan, the maximum completion time of the system, is used to evaluate the efficiency of the system, as defined in Equation (5). • avgCost, the average task cost, which evaluates the cost of a task, is defined as follows: • avgCompletedTime, the average task competition time, is defined as follow: • mSLA, the percentage of the tasks that meet the requirements on service quality in the total tasks, evaluates the ability of the algorithm to meet the requirements of tasks on service quality, which is defined as follows: mSLA = N mSLA N task (24) where N mSLA represents the proportion of tasks that meet the requirements on service quality among all tasks. This paper assumes that the indicators of the requirements of tasks on service quality are TCost and ECT; that is, the tasks that meet the two indicators are defined as the tasks that meet the quality of service requirements.
(a) data sets of 100 tasks (b) data sets of 100 tasks (c) data sets of 100 tasks  Figure 4 shows the change of the total reward of the Actor-Critic algorithm on 100, 500, and 1000 test tasks set, respectively. As can be seen from the figure, the total reward of the model obtained at the initial stage of training is low on the task set, which indicates that the model needs to be trained a certain number of times to obtain a better resource allocation scheme. In addition, it can be found that the convergence rate of the algorithm on the different number of tasks is not the same. In general, the more tasks the algorithm converges, the slower the algorithm converges. When the number of training iterations reaches 40, 50, and 70, the total reward of the algorithm on 100, 500, and 1000 test tasks set gradually becomes stable. This paper illustrates the advantages of DRL in resource scheduling by comparing it with other three heuristic and meta-heuristic resource-scheduling algorithms. Table 4 shows the performance of four different resource-scheduling algorithms on the different number of task sets. The indexes are maxMakespan (the maximum completion time of the system), avgCost (the average task cost, which evaluates the cost of a task), avgCompletedTime (the average task competition time), and mSLA (the proportion of tasks that meet the requirements on service quality among all tasks). The four algorithms are the random selection algorithm, greedy algorithm, GA (genetic algorithm), and A2C (the proposed algorithm based on deep reinforcement learning).
As for the performance of the maxMakespan index, it can be seen from the table that the maximum completion time of the system obtained by using a random scheduling algorithm on 100, 500, and 1000 task sets is the largest among the four scheduling algorithms and is proportional to the number of tasks. The GA algorithm and A2C algorithm come second, and the greedy strategy has the least. This trend also increases with the number of tasks, and the gap between different algorithms becomes more and more obvious. Figure 5 shows the histogram of different scheduling algorithms on maxMakespan, which can intuitively show the advantages and disadvantages of different algorithms on maxMakespan. For avgCost and avgCompletedTime, different from maxMakespan, the greedy algorithm performed poorly in these two indexes. The avgCost obtained by the greedy algorithm is only slightly better than that of the random selection scheduling algorithm, but the gap is not big. The performance of the other three algorithms on this index from bad to good is the random selection algorithm, GA algorithm, and A2C algorithm, respectively. Figures 6 and 7, respectively, show the histogram of performance of different scheduling algorithms at avgCost and avgCompletedTime.  As for the mSLA index, it can be seen that the random selecting scheduling algorithm on the test set can only ensure that less than 50% of the tasks meet the requirements of service quality. The greedy algorithm and random selection algorithm are not much different. The GA algorithm can ensure that about 80% of the tasks meet the quality of service requirements. The A2C algorithm can ensure that more than 90% of the tasks can meet the requirements under different task numbers, but the proportion decreases with the increase of tasks. Figure 8 shows the histogram of the performance of different resource-scheduling algorithms on mSLA. Combined with the pictures and the table, random selection scheduling has the worst effect on any index because of its randomness and uncertainty. Although the greedy algorithm has an excellent performance in the completion time of the system, it can not well meet the service quality requirements of the task, and the effect of other evaluation indicators is not good. This is because the greedy algorithm only considers the maximum completion time of the system and does not take into account other factors such as the task's quality of service requirements. The A2C algorithm and GA algorithm not only consider the maximum completion time of the system but also consider the task quality of service requirements and other constraints. The evaluation on the different number of tasks and different indexes shows that the A2C algorithm has a better effect than the traditional GA algorithm on this problem model. For different system sizes and complex environments, the A2C algorithm based on deep reinforcement learning can constantly optimize its scheduling strategy through the feedback of the environment, and after a certain degree of pre-training, it can conduct efficient and fast scheduling. Experiments show that the proposed algorithm can reduce the maximum completion time of the system based on satisfying the requirements of task quality of service, which makes the whole blockchain system more efficient in scheduling federated learning and training tasks.

Conclusions
This paper designs a federated learning multi-task scheduling mechanism based on a trusted computing sandbox. First of all, a system framework for completing federated learning tasks in the blockchain environment is proposed. The blockchain reasonably allocates resources for each federated learning task and constructs a computing sandbox as a state channel. The federated learning model training process is carried out in the channel and supervised by all nodes. Secondly, considering the factors such as the resource heterogeneity of each participating node, system efficiency, and the requirements of tasks on service quality, an optimization problem model of resource allocation and scheduling in the blockchain scenario was constructed. Finally, the problem model was constructed as an MDP model, and a resource-scheduling algorithm based on Actor-Critic was designed to solve the problem. The experimental results show that the proposed algorithm has good convergence under different task number scenarios, and it can reduce the maximum completion time of the system and improve the efficiency of the system while meeting the requirements of tasks on service quality.
For future work, the privacy and security issues behind the value sharing of a large amount of data will be paid more and more attention. Federated learning and blockchain technology will play an increasingly important role in the field of data privacy and security through the advantages of local data and the efficient and safe data-sharing method of decentralization. How to protect user data privacy through federated learning in the decentralized mode will become the focus of future research. In the next step, we will study how to realize the discovery and supervision of malicious behaviors of nodes in the computing sandbox during the training of federated learning to prevent privacy leakage and attacks in time.