Differential Optimization Federated Incremental Learning Algorithm Based on Blockchain

: Federated learning is a hot area of concern in the ﬁeld of privacy protection. There are local model parameters that are difﬁcult to integrate, poor model timeliness, and local model training security issues. This paper proposes a blockchain-based differential optimization federated incremental learning algorithm, First, we apply differential privacy to the weighted random forest and optimize the parameters in the weighted forest to reduce the impact of adding differential privacy on the accuracy of the local model. Using different ensemble algorithms to integrate the local model parameters can improve the accuracy of the global model. At the same time, the risk of a data leakage caused by gradient update is reduced; then, incremental learning is applied to the framework of federated learning to improve the timeliness of the model; ﬁnally, the model parameters in the model training phase are uploaded to the blockchain and synchronized quickly, which reduces the cost of data storage and model parameter transmission. The experimental results show that the accuracy of the stacking ensemble model in each period is above 83.5% and the variance is lower than 10 − 4 for training on the public data set. The accuracy of the model has been improved, and the security and privacy of the model have been improved.


Introduction
Google proposed a new privacy protection technology-federated learning-in 2016 [1][2][3]. Due to its advantages of protecting privacy and local data security, it is widely used in many fields.
Federated learning can be applied to the fields of machine learning and deep learning.
In the field of machine learning, Yang, K. et al. [4] propose an implementation method of vertical federated logistic regression under the central federated learning framework, which can realize the logistic regression in vertical federated learning. However, it is difficult to find a third-party auxiliary party that both partners trust in real life. Therefore, Yang, S.W. et al. [5] proposed a vertical federated logistic regression method under the decentralized federated learning framework. The data between the partners are always confidential, and the transmission channel is also confidential. Liu, Y. et al. [6] proposed a framework based on a central longitudinal federated study method of random forestsfederal forest. In the process of modeling, each tree is joint modeling and its structure is stored in each data center server, and the parties have the right to hold it, but each party holds data matching with their characteristics of scattered node information, while unable to obtain useful information from other data-holding parties, to ensure the privacy of data. Finally, the structure of the whole random forest model is scattered and stored, the central server retains the complete structure information, and the node information is scattered among all data holders. When using the model, the node information stored locally is first used, and then the distributed nodes are coordinated by the central server. This method reduces the communication frequency of each tree during the prediction, which is helpful to communication efficiency. A decentralized horizontal federated learning framework for multi-party GBDT modeling, federated learning based on similarity, is proposed by Li, Q.B. et al. [7] and compensates for communication efficiency but sacrifices a small amount of privacy protection performance. Hartmann V. et al. [8] put forward a method of deploying support vector machines in federated learning, mainly protecting data privacy using feature hash and update block.
In the field of deep learning, Zhu, X.H. et al. [9][10][11] used a simple CNN to test the existing federated learning framework and the impact of the number of data sets and clients on the federated model. Li, T. et al. [12] proposed FedProx, a federated learning framework for solving statistical heterogeneity, to train LSTM classifiers in the federated data set, which is mainly used for emotion analysis and character prediction.
The training data of federated learning come from different data sources, so the distribution and quantity of training data become the conditions that affect the federated model. If the training data distribution of the data sources is different, it becomes difficult to integrate the local models of multiple parties [13,14]. The logistic regression model was used as the initial global model to train the data of all data sources, and the neural network was used to integrate the local model. However, the performance of the neural network model was non-convex, so it was difficult to achieve the optimal loss function of the model after averaging the parameters. Aiming at this problem, the federated average algorithm (FedAvg) is proposed to integrate the local models of multiple parties with the average value of weights or gradients to obtain the integrated global model [15][16][17][18]. However, the gradient depth leak algorithm is proposed for the federated average algorithm, which can restore most of the training data problems according to the gradient update of the local model [19,20]. At the same time, the timeliness of federated models and the security of data and models when training local models are not considered in the above references.
Based on the above problems, this paper puts forward the differential optimization federated incremental learning algorithm based on the blockchain. First, the exponential mechanism and Laplace mechanism are used in the training process of local models to enhance the security of local models and data, but this will result in a loss of accuracy of local models. Second, by optimizing the parameters and weights of decision trees to improve the accuracy of local random forests, we can alleviate the problem of precision loss caused by adding differential privacy technology to local models. Third, using a stacking ensemble algorithm to integrate multiple local model parameters can reduce the risk of information leakage caused by updating local parameters, and further improve the accuracy of the model. Fourth, the initial global model parameters, the local model parameters, and the updated global models are uploaded to the blockchain in each period and quickly synchronized. The consensus algorithm in the blockchain is improved, and a consensus mechanism based on the quality of training parameters (proof of quality, PoQ) [21] is proposed. To improve the efficiency of the blockchain, and from the data storage point of view, storing the model parameters in the blockchain can improve the security and reliability of the model parameters. Experimental results show that the accuracy of the algorithm is higher than that of the federated average algorithm [22], which improves the security of data and model in the training model phase. At the same time, compared with the general federated learning model, the algorithm can automatically upload each time segment and iterated parameters and results to the corresponding data blocks and synchronize quickly, greatly reducing the transmission cost of the model parameters to the blockchain. At the same time, because the blockchain data cannot be tampered with and cannot be deleted, it can protect the model parameters stored on the blockchain.
The main contributions of this paper are as follows: 1. This paper proposes a new federated learning algorithm-differential optimization federated incremental learning algorithm based on blockchain.

2.
This method is an experiment on stream data, which verifies the effect of the algorithm on stream data.

3.
Considering the risk caused by gradient leakage, the algorithm proposed in this paper applies differential privacy to the algorithm. Gaussian noise is added to the data during model training, and Laplace noise is added to the output of the local model.

4.
This experiment is conducted on an unbalanced data set, taking into account the balance between model accuracy and privacy.
The paper is ordered as follows. The algorithm flow and performance analysis are described in the second section. Experimental environment and data set source, data set division to build multi-source stream data, and specific experimental settings are described in Section 3. Finally, conclusions are presented in Section 4.

Description of the Algorithm
The blockchain-based differential optimization federated incremental learning algorithm applies differential privacy and ensemble learning to the framework of federated learning. The algorithm includes three stages: model transmission, model training, and model storage. In the model transmission stage, a 512-byte asymmetric encryption algorithm is used to ensure the security of the model transmission process; in the model training stage, differential privacy technology and ensemble algorithms are used to optimize and integrate the parameters while improving the security of the data and the model algorithms to improve the accuracy of the model; in the model storage stage, the blockchain is used to store the parameters of each model in each period, which greatly reduces the cost of data transmission and guarantees the security of the data.

Model Transmission Stage
The algorithm in the model transmission stage is as follows: first, each data source uses the RSA encryption algorithm to generate a 512-byte key pair, and a trusted third party uses the public key to encrypt the initial global model and transmits it to each data source. Each data source uses the private key. Training after decryption ensures the safety of the model type during transmission; each data source uses the private key to encrypt the local model parameters and transmits it to a trusted third party. After the trusted third party uses the public key to decrypt, the ensemble algorithm is used to integrate local model parameters to ensure the safety of local model parameters during transmission, see the purple part in Figure 1.

Model Training Stage
The algorithm in the model training stage is: each data source divides the incremental data into three parts: pre-training set, pre-test set, and test set. The initial global model obtained after decryption with the private key is trained on the pre-training set, and tested on the pre-test set, the score obtained is used as the weight of the base classifier in the initial global model, differential privacy technology is added, the parameters of the model are optimized, a local model that meets privacy protection is established, and the local model is placed on the test set. The training score is used as the local model score of the data source; a trusted third party uses a stacking ensemble algorithm and averaging method to integrate multiple local models to obtain an updated global model for each period, and iterative training is continued, see the green part in Figure 1.  Figure 1. Blockchain-based differential optimization joint incremental learning algorithm framework.

Model Training Stage
The algorithm in the model training stage is: each data source divides the incremental data into three parts: pre-training set, pre-test set, and test set. The initial global model obtained after decryption with the private key is trained on the pre-training set, and tested on the pre-test set, the score obtained is used as the weight of the base classifier in the initial global model, differential privacy technology is added, the parameters of the model are optimized, a local model that meets privacy protection is established, and the local model is placed on the test set. The training score is used as the local model score of the data source; a trusted third party uses a stacking ensemble algorithm and averaging method to integrate multiple local models to obtain an updated global model for each period, and iterative training is continued, see the green part in Figure 1.

Model Storage Stage
The algorithm in the model storage stage is: the initial global model parameters in each period are encrypted with a private key and uploaded to the corresponding block. The verification node on the blockchain uses a consensus mechanism based on the proof of training quality. If 2 3 nodes think that the initial global model parameters in this

Model Storage Stage
The algorithm in the model storage stage is: the initial global model parameters in each period are encrypted with a private key and uploaded to the corresponding block. The verification node on the blockchain uses a consensus mechanism based on the proof of training quality. If 2/3 nodes think that the initial global model parameters in this period are the same as the updated global model in the previous period, and they are stored in the corresponding data block 1 in the block generated in this period, for each period, the local model parameters inside are encrypted with a private key and uploaded to the corresponding block. The nodes on the blockchain use a consensus mechanism based on the quality of the training parameters to verify the accuracy of the model. If the accuracy of the local model trained during this period is not up to the lowest accuracy rate determined by the 2/3 nodes, the data source needs to optimize the model parameters to further optimize the local model and improve the accuracy of the local model. Until the accuracy of the local model of the data source meets the requirements, the local model parameters can be changed. Stored in data blocks 2 to n − 1, the global model updated in each period is encrypted with a private key and uploaded to the corresponding block. Its nodes on the blockchain use a consensus mechanism based on the quality of training parameters for verification. If 2/3 nodes consider that the updated global model parameters in this period are comparable with the updated global model in the previous period, and the accuracy fluctuates within an acceptable range, then the global model parameters updated in this period can be compared and stored in the data block n corresponding to the block, see the blue part in Figure 1.
The algorithm framework is as follows:

Algorithm Flow
The algorithm flow is as follows: Model transfer phase Step 1: Each data source uses the RSA encryption algorithm to generate a 4096-bit public key and private key, and transmits the public key to a trusted third party; Step 2: A trusted third party uses the public key to encrypt the initial global model and transmits it to each data source; Step 3: After each data source obtains the local model, use the private key to encrypt the local model and transmit it to a trusted third party. The trusted third party uses the public key to decrypt and integrate multiple local models.
Model training phase Step 1: Each data source uses the private key to decrypt to obtain the initial global model-random forest; Step 2: Each data source determines the initial parameters of the algorithm, the number of decision trees: L, the number of pre-test samples: X, and the pre-pruning parameter: ε; Step 3: Each data source extracts L training sets D t (t = 1, 2, . . . , L) from the incremental data with replacement, and divides the samples in the training set D t into training data and test data; Step 4: Each data source selects a certain percentage of data from the training data as pre-training samples, and the remaining samples in the training data are used as pre-test samples; Step 5: Evenly distribute the privacy protection budget B to each tree ε = B L , each layer ε = ε d+1 , and divide the privacy protection budget of each node into two equal ε = ε 2 ; Step 6: Randomly select m features from the training sample; if the m features contain n continuous features, assign the privacy protection budget in each node to each continuous feature, and assign a copy to the discrete feature ε = ε n+1 . For continuous features, use the calculated value from the formula calculate the corresponding Gini index, and then select the best continuity feature; Step 7: Compare the Gini index corresponding to the best continuous feature with the Gini index corresponding to each discrete feature, and select the split feature and split point with the smallest Gini index among the randomly selected m features, according to this feature and the best split point, divide the current node into two child nodes, and use Step 6 and Step 7 for each child node; Step 8: If the node reaches the stopping condition, set the current node as a leaf node, and use the Laplace mechanism to add noise to classify the current node; otherwise, set the current node as a child node, calculate the number of samples of the child node, and use the Laplace mechanism to add noise N = NoisyCount (the number of child node samples), and establish a decision tree that satisfies ε-differential privacy protection; Step 9: Use the classification accuracy of L decision trees in the pre-test samples as the weight of L decision trees to form a random forest that satisfies ε-differential privacy; Step 10: Iteratively optimize the parameters in Step 1, select the final optimized parameters, and generate an optimized random forest model that satisfies ε-differential privacy.
Step 11: A trusted third party integrates multiple local models that meet differential privacy using a stacking ensemble algorithm and averaging method to obtain an updated global model.
Model storage stage Step 1: A trusted third party uses the ECC encryption algorithm to generate a 521-bit key and a private key and transmits the generated private key to each data source while keeping a private key in each period, and the public key is transmitted to the corresponding generated block i; Step 2: Each data source uses the private key to encrypt the initial global model and uploads it to the corresponding generated block i. The corresponding block is decrypted with the public key. The nodes on the blockchain adopt a consensus mechanism based on the proof of training quality (PoQ), for verification, if 2/3 nodes believe that the initial global model parameters in this period are the same as the updated global model parameters in the previous period, that is, w h i = w h i−1 , then the initial global model parameters in the period are stored in the corresponding data block 1 in the generated block; Step 3: Each data source uses the private key to encrypt the local model and uploads it to the corresponding block i. The corresponding block is decrypted with the public key. The nodes on the blockchain adopt a consensus mechanism based on PoQ for verification. If the accuracy of the local model trained in this period does not reach the minimum accuracy determined by 2/3 nodes, the data source needs to further optimize the local model and improve the accuracy of the local model until the accuracy of the local model meets the requirements, that is, it satisfies the formula score local_model ≥ α (α is the lowest accuracy rate recognized by 2/3 nodes), and the local model parameters can be stored in data blocks 2 to n − 1; Step 4: The global model updated in each period is encrypted with a private key and uploaded to the corresponding block, the corresponding block is decrypted with the public key, and the nodes on the blockchain use a consensus mechanism based on the quality of training parameters for verification. If 2/3 nodes think that the updated global model parameters in this period are comparable with the updated global model in the last period, and the accuracy rate fluctuates within an acceptable range, namely score h i − score h i−1 ≤ β (β is the acceptable fluctuation range of 2/3 nodes), then the global model parameters updated in this period can be stored in the data block n corresponding to this block.
The schematic diagram of the storage part is shown in Figure 2: Step 4: The global model updated in each period is encrypted with a private key and uploaded to the corresponding block, the corresponding block is decrypted with the public key, and the nodes on the blockchain use a consensus mechanism based on the quality of training parameters for verification. If 2 3 nodes think that the updated global model parameters in this period are comparable with the updated global model in the last period, and the accuracy rate fluctuates within an acceptable range, namely

Performance Analysis
The federated incremental learning algorithm based on differential optimization based on blockchain proposed in this paper is more accurate than the federated average algorithm, and the security of the model is also improved.

Complexity Analysis of the Algorithm
The complexity of the differential optimization federated incremental learning algorithm: the complexity of the RSA encryption algorithm, the complexity of the ECC digital signature algorithm, the complexity of model transmission, the complexity of model update, and the complexity of model storage, that is, the sum of time complexity is

Performance Analysis
The federated incremental learning algorithm based on differential optimization based on blockchain proposed in this paper is more accurate than the federated average algorithm, and the security of the model is also improved.

Complexity Analysis of the Algorithm
The complexity of the differential optimization federated incremental learning algorithm: the complexity of the RSA encryption algorithm, the complexity of the ECC digital signature algorithm, the complexity of model transmission, the complexity of model update, and the complexity of model storage, that is, the sum of time complexity where n is the sample, d is the feature, k is the number, N is the complexity of the encryption algorithm, w is the complexity of the initial global model transmission, G is the complexity of the local model transmission, and l is the number of rounds), using the federated learning and stacking ensemble algorithm, will inevitably cause the time complexity of this algorithm to be higher than the federated average algorithm.

Security Analysis of the Algorithm
The algorithm uses the federated learning framework and the idea of ensemble learning. In the model training stage, the algorithm is implemented under the framework of federated learning, and the data of each data source are always stored locally, eliminating the risk caused by data transmission, thereby improving the security of the model and data. In the model storage stage, the algorithm uses ECC to generate a key pair by a trusted third party and uses a private key to sign the initial global model parameters, local model parameters, and updated global model parameters in each period and transmit them to the block i. The block i is verified using the public key and sequentially stored in the data block, which can ensure the security of the model storage process.

Timeliness Analysis of the Algorithm
The algorithm adopts the idea of incremental learning. From the data level, the data generated by each data source in each period are used as the data when the model is trained in that period, to ensure the timeliness of the data level; from the model at that level, the global model is updated in the period and is used as the initial global model for the next period for training, and the local model and the updated global model for the next period can be obtained, thereby ensuring the timeliness of the model level.

Privacy Analysis of the Algorithm
The algorithm proposed in this paper first distributes the given privacy protection budget B equally to T trees in the random forest ε = B T . Since the samples in each tree are randomly selected again, there will be a certain cross. According to the sequence combination of differential privacy protection, the consumed privacy protection budget is the superposition of the privacy protection budget consumed by each decision tree. The privacy protection budget ε [23] is equally allocated to the leaf nodes in the tree, namely ε = ε /2(d + 1). If the node is a leaf node, the other half of the privacy protection budget is used combined with the Laplace mechanism to add noise to the count value to determine the category of the leaf node; if the node is an intermediate node, the other half of the privacy protection budget is used combined with the index mechanism and Laplace mechanism to select the best split feature and the best split point [14]. The privacy protection budget of each data source is not greater than B, and according to the sequence combination of differential privacy [15,16], this algorithm can meet the requirements of ε-differential privacy protection.

Experimental Parameter Setting
The algorithm is developed and implemented in the Python language and Pycharm integrated software. The experimental hardware environment is: Intel(R) Core i5-4200 MCPU2.50 GHz processor, memory 8 G; the operating system is Windows 10. In terms of experimental data, the data set downloaded from https://www.heywhale.com/mw/dataset/ 5e61c03ab8dfce002d80191d/file (accessed on 1 June 2022) (Supplementary Material) is used, and the data set is 15.6 Mb. This data set is the data set in the actual data competition, and has practical significance. Details of the dataset are provided in the supplementary materials.

Analysis of Experimental Data
The data set is randomly divided into 12 parts to represent the data generated by different data sources in different periods, and the result of 20 divisions represents the randomness and rationality of the data. The data set randomly divided 20 times reflects the relationship between the data before and after the division. The randomly divided data set can meet the needs of the same data source and different samples and can meet the rationality of the cross-validation model.

Analysis of Experimental Model
The experiment in this article is divided into three parts: model distribution, model training, and model storage. The first part is to use the random forest as the initial global model and use the public key generated by the RSA encryption algorithm to encrypt and transmit to each data source. Each data source is decrypted with a private key. The initial global model is obtained. The second part: each data source trains the acquired initial global model on incremental data, optimizes the number of trees L and the number of pre-test samples X, and the pre-pruning parameter l, obtains the period local model, then uses the private key generated by the RSA encryption algorithm to encrypt the local model on each data source and transmit it to a trusted third party. The trusted third party uses the public key to decrypt and integrates the local model with the stacking ensemble algorithm to obtain the most effective updated global model during the period. The third part: a trusted third party uses the ECC digital signature algorithm to generate a key pair, transmits its private key to each data source, and retains a private key, and the public key is transmitted to the corresponding block. Each data source is a trusted third party that uses a private key to sign the initial global model parameters, local model parameters, and updated global model parameters in each period and transmit them to the corresponding block. The block uses the public key to verify and store it in the corresponding data block.
In the model distribution stage, to ensure that each initial global model H i can be safely transmitted to each data source, the initial global model needs to be encrypted with a 4096-bit public key for transmission.
In the model training phase, each data source uses the private key to decrypt to obtain the initial global model-random forest-uses the random forest to train on each data source, and optimizes the tree L and the number of pre-test samples X, and pre-pruning parameters l. The optimal parameter model in the t 1 period is obtained. Detailed parameter optimization is shown in the Supplementary Material.
The accuracy of the random forest in the t 1 period can be expressed as the average of the training accuracy of the random forest on the incremental data generated by each data source in the t 1 period, which can ensure the accuracy of each data. Three local models h in are generated each time. To test the performance of the local model, the average value and variance are used to measure. Table 1 shows the performance of the initial global model H 0 on each data source.
Among them, k = 1, 2, 3 indicates the number of data sources. To see the difference between the variances more clearly, the original variance is magnified by 10 5 . In the following table, k indicates the data source, and the variances are all magnified by 10 5 .
It is obvious from Table 1 that the accuracy of the initial global model H 0 trained on the data generated in the t 1 period is above 74.5%, and the variance is very small, indicating that the performance of multiple local models is good, and weighting is slightly better than unweighting, and the model without differential privacy is significantly better than the differential privacy protection model with different privacy protection budgets. When the privacy protection budget is 0.25, the accuracy of the model is slightly better than 0.5 and 0.75, indicating privacy while the protection is guaranteed, and the accuracy of the model is also improved. It can be seen from Table 2 that when no differential privacy is added, the accuracy of the weighted model is slightly better than that of the unweighted model. The variance of the weighted model is greater than that of the non-unweighted model, but the variances in Table 2 are all less than 10 −4 . For the model with differential privacy, when the privacy protection budget is 0.25, the accuracy of the model is the lowest, followed by the privacy protection budgets of 0.5 and 0.75. It shows that the smaller the privacy protection budget, the lower the availability of the model, but the higher the privacy of the data and the model; compared with the federated average algorithm, the accuracy of the updated global model obtained by the stacking ensemble algorithm is increased by about 5%, and at the same time it is smaller than the variance of the federated average algorithm, indicating that the stability and generalization ability of the model are better than that of the federated average algorithm.
To test the training results of the global model h 1 updated in the t 1 period, the global model h 1 updated in the t 1 period is used as the initial global model H 1 in the t 2 period, and training is performed on the data generated in the t 2 period. The initial global model H 1 is used to train on each data source, the tree L of the tree, the number of pre-test samples X, and the pre-pruning parameter l are optimized, and the model of the optimal parameter in the t 2 period is obtained. Detailed parameter optimization is shown in the Supplementary Material.
The accuracy of the random forest in the t 2 period can be expressed as the average of the training accuracy of the random forest on the incremental data generated by each data source in the t 2 period, which can ensure the accuracy of each data. Three local models are generated each time. To test the performance of the local model, the average value and variance are used to measure. Tables 3 and 4 indicate that the initial global model H 1 is the update of each data source using a stacking ensemble during the t 1 period. The performance of the updated global model was obtained by the global model and the federated average algorithm in the t 2 period. It is obvious from Table 3 that the accuracy of the initial global model H 1 trained on the data generated by the t 2 period is above 83.5%, and the variance is very small, indicating that the performance of multiple local models is good, the weighted and unweighted accuracy rates are almost equal because the model trained on the data generated in the t 2 period have converged, and compared with the t 1 period, the accuracy of the differential privacy protection model with different privacy protection budgets is the same as the accuracy of the model without differential privacy.
It is obvious from Table 4 that when the initial global model H 1 is a federated average algorithm, the accuracy of the model trained on the data generated in the t 2 period is all above 80%, and the variance is very small, indicating multiple local models compared with the t 1 period, and the accuracy of the differential privacy protection model with different privacy protection budgets increased by about 5%, indicating that the model has strong generalization. It can be seen from Table 5 that when differential privacy is not added, the accuracy of the weighted model is almost equal to that of the unweighted model. The variance of the weighted model is greater than that of the non-weighted model, but the variances in the table are all less than 10 −4 ; for the model with differential privacy, when the privacy protection budget is 0.25, the accuracy of the model is the lowest, followed by the privacy protection budgets of 0.5 and 0.75. It shows that the smaller the privacy protection budget, the lower the availability of the model, but the higher the privacy of the data and the model; compared with the federated average algorithm, the accuracy of the updated global model obtained by the stacking ensemble algorithm is increased by about 2%, and at the same time it is smaller than the variance of the federated average algorithm, indicating that the stability and generalization ability of the model are better than that of the federated average algorithm.
To fully test the influence of increasing or reducing data sources on the algorithm, the t 3 period is reduced by one data source (two data sources) for training, and the global model h 2 updated in the t 2 period is used as the initial global model H 2 in the t 3 period. Training on the data generated in the t 3 period is carried out. The initial global model is used to train on each data source, the tree L of the tree, the number of pre-test samples X, and the prepruning parameter l are optimized, and the model of the optimal parameter in the t 3 period is obtained. Detailed parameter optimization is shown in the Supplementary Material.
The accuracy of the random forest in the t 3 period can be expressed as the average of the training accuracy of the random forest on the incremental data generated by each data source in the t 3 period, which can ensure the accuracy of each data. Two local models h in are generated each time. To test the performance of the local model, the average value and variance are used to measure. Tables 6 and 7 indicate that the initial global model H 2 is the update of each data source using a stacking ensemble during the t 2 period. The performance of the updated global model was obtained by the global model and the federated average algorithm in the t 3 period. It is obvious from Table 6 that the accuracy of the initial global model H 2 trained on the data generated in the t 3 period is more than 83.5%, and the variance is very small, which shows that the performance of multiple local models is good and has good stability. Compared with the accuracy of the model in the t 2 period, the accuracy of the model is improved, which indicates that the initial global model H 2 has a strong generalization ability.
It is obvious from Table 7 that the accuracy of the initial global model H 2 trained on the data generated in the t 3 period is more than 83.5%, and the variance is very small, which shows that the performance of multiple local models is good and has good stability. Compared with the accuracy of the model in the t 2 period, the accuracy of the model is improved by more than 3%, which indicates that the initial global model H 2 has a strong generalization ability. It can be seen from Table 8 that the accuracy of the weighted model is almost equal to that of the unweighted model without adding differential privacy, and the variance of the weighted model is greater than that of the non-weighted model, but the variance in the table is less than 10 −4 ; for the model with differential privacy, when the privacy protection budget is 0.25, 0.5, 0.75, the accuracy of the model is almost equal, compared with the federated average algorithm, and the accuracy of the updated global model obtained by the stacking ensemble algorithm is almost equal to that obtained by the federal average algorithm, but the security of the model is improved. To fully test the influence of increasing or reducing data sources on the algorithm, one data source (four data sources) was added for training in the t 4 period, and the global model h 3 updated during the t 3 period was used as the initial global model H 3 in the t 4 period. Training on the data generated in the t 4 period is carried out. The initial global model H 3 is used to train on each data source, and the number L of the tree, the number of pre-test samples X, and the pre-pruning parameter l are optimized, to obtain the model of the optimal parameter in the t 4 period. Detailed parameter optimization is shown in the Supplementary Material.
The accuracy of random forest in the t 4 period can be expressed as the average value of training accuracy of 20 iterations on incremental data generated by each data source, which can ensure the accuracy of each data and, at the same time, the four local models h in were generated each time. To test the performance of local models, the average value and variance were used to measure. Tables 9 and 10 show the performance of the initial global model H 3 in the t 4 period of the updated global model obtained by the stacking ensemble algorithm and the federated average algorithm for each data source in the t 3 period. From Table 9, it can be seen that the initial global model is the updated global model of the stacking ensemble. The accuracy rate of most of the models trained on the data generated in the t 4 period is more than 84%, and the variance is very small, indicating that the performance of several local models is good and has good stability.
From Table 10, it can be clearly seen that the initial global model H 3 is the updated global model of the federated average algorithm, and the accuracy rate of most of the models trained on the data generated in the t 4 period is more than 84%, and the variance is very small, indicating that the performance of several local models is good and has good stability.
It is obvious from Table 8 that when the initial global model H 3 is a federated average algorithm, the accuracy of the model trained on the data generated in the t 4 period is all above 80%, and the variance is very small, indicating multiple local models compared with the period t 3 , and the accuracy of the differential privacy protection model with different privacy protection budgets increased by about 5%, indicating that the model has strong generalization. It can be seen from Table 11 that the accuracy of the weighted model is greater than that of the unweighted model, and the variance of the weighted model is greater than that of the non-weighted model, but the variances in the table are less than 10 −4 , which meets the experimental requirements. For the model with differential privacy, the accuracy of the model is almost equal when the privacy protection budget is 0.25, 0.5, and 0.75. Compared with the federated average algorithm, the accuracy of the updated global model obtained by the stacking ensemble algorithm is almost the same as that obtained by the federated averaging algorithm, but the security of the model is improved.
In the model storage phase, the trusted third party uses the ECC encryption algorithm to generate the key pair with a length of 512. The public key is broadcast to the corresponding block, and the private key is transmitted to each data source separately, and at the same time, one reservation is maintained.
The Raybaas platform can quickly and efficiently build blockchain-based services and applications. The hardware devices have an Intel(R) Core i5-4200 m CPU 2.50 GHz processor. The underlying blockchain is deployed based on a CentOS 7.6 operating system. The stored data include initial global model parameters, local model parameters, and updated global model parameters in the t 1 period. For storing the initial global model parameters, the trusted third party encrypts the initial global model in the t i period by using the private key retained in the period and transfers it to the block i. The verification nodes in the blockchain are decrypted using the corresponding public key and verified by the consensus mechanism based on the training parameter quality. If 2/3 verification nodes consider that the initial global model parameters in the period are the same as the updated global model parameters in the previous period, namely, the formula w H i = w h i−1 , then the initial global model parameters w H i in the period will be stored in the corresponding data block 1 in the generated block i. For storing local model parameters, the private key within the t i time of each data source encrypts the local model in the period and uploads it to the corresponding block i. The verification nodes in the blockchain are encrypted using the corresponding public key and the consensus mechanism based on the training parameter quality is used for verification. The verification node in the blockchain uses the corresponding public key to decrypt, and uses a consensus mechanism based on the quality of training parameters for verification. If the accuracy rate of the local model trained in this period cannot reach the minimum accuracy rate determined by the node, the data source needs to further optimize the local model to improve the accuracy of the local model until the accuracy of the local model of the data source meets the requirements, that is, the formula score local_model ≥ α (α is the minimum accuracy rate determined by 2/3 nodes) can be met, and the local model parameters can be stored in data block 2 to n − 1.
For the updated global model parameters, the trusted third party encrypts the updated global model in the t i period by using the private key retained in the period and transfers it to the block i. The verification nodes in the blockchain are decrypted using the corresponding public key and verified by the consensus mechanism based on the training parameter quality. If 2/3 nodes believe that the updated global model parameters in this period are comparable with the updated global model in the previous period, the accuracy fluctuates within an acceptable range, that is, score h i − score h i−1 ≤ β (β is the acceptable fluctuation range of 2/3 nodes), then the global model parameters updated in the period can be stored in the data block n corresponding to the block.

Summary of the Experiment
The algorithm distributes the differentially weighted optimization random forest to each data source and performs training, and uses the stacking ensemble algorithm to integrate multiple local models. The updated global model h 1 has an accuracy of 84.3797333%, 84.3813333%, and 84.3925333% on the incremental data in the t 1 period. As the data source of the t 2 period is consistent with that of the t 1 period, the updated global model is distributed to the data source as the initial global model of the t 2 period and trained, and the accuracy of using the stacking ensemble is 84.3301333%, 84.2176%, and 84.2784%, respectively, indicating that the updated global model has strong generalization. To reduce one data source in the t 3 period, the updated global model h 2 was distributed to the data source and trained as the initial global model in the t 3 period, and the accuracy of using the stacking ensemble is 83.8853333%, 84.148%, and 83.8693333%. To add one data source in the t 4 period, the updated global model h 3 is distributed to the data source and trained as the initial global model in the t 4 period, and the accuracy of using the stacking ensemble is 83.8506667%, 83.9306667%, and 83.824%. Compared with the federated average algorithm, the accuracy of the differential weighted optimization random forest is increased by up to 5%, and the average period is increased by 1%. At the same time, the security of the model and data during the training process has been greatly improved.
The comparative experiment in this paper is to compare the stacking integration algorithm with the benchmark algorithm FedAvg algorithm in federated learning under different privacy protection budgets. Under the influence of different privacy protection budgets and weights, the experimental results of the federal average algorithm and stack integral algorithm in stream data are shown in the tables.

Conclusions
This paper proposes a differential optimization federated incremental learning algorithm based on blockchain. Applying differential privacy and incremental learning to the framework of federated learning can enhance the security and timeliness of data and models. The accuracy of the model is affected by the differential privacy adding model. To mitigate the influence of adding differential privacy to the model, the model is weighted. The initial global model parameters, local model parameters, and updated global models of each period are uploaded to the blockchain. The verification nodes in the blockchain are validated by a consensus mechanism based on training parameter quality. The required parameters are stored in the corresponding data blocks according to rules and quickly synchronized, reducing the transport cost data and at the same time guaranteeing the safety of the model parameters. When optimizing model parameters, this paper adopts the idea of a set, that is, the method of selecting the best among the best to select the optimal parameters. However, in the case of high dimensions, the relatively excellent parameters are obtained, which cannot achieve the effect of optimal parameters. Later, optimization algorithms will be used to optimize. In the following work, we will try to apply this algorithm to other privacy protection technologies to further improve the security of data and models based on ensuring model accuracy.
Supplementary Materials: The data set of this experiment comes from https://www.heywhale.com/ mw/dataset/5e61c03ab8dfce002d80191d/file (accessed on 1 June 2022). There are 200,000 samples in this data set, of about 15.6 Mb, where: caseid represents the case number, which has no practical significance; Q 1 represents the information of the first question. The information is encoded into numbers, and the size of the numbers does not represent the real relationship. Q k represents the information of the k-th question. There are a total of 36 questions. Evaluation represents the final audit result; 0 means the claim is granted and 1 means the claim is not approved.