A Federated Incremental Learning Algorithm Based on Dual Attention Mechanism

: Federated incremental learning best suits the changing needs of common Federal Learning (FL) tasks. In this area, the large sample client dramatically inﬂuences the ﬁnal model training results, and the unbalanced features of the client are challenging to capture. In this paper, a federated incremental learning framework is designed; ﬁrstly, part of the data is preprocessed to obtain the initial global model. Secondly, to help the global model to get the importance of the features of the whole sample of each client, and enhance the performance of the global model to capture the critical information of the feature, channel attention neural network model is designed on the client side, and a federated aggregation algorithm based on the feature attention mechanism is designed on the server side. Experiments on standard datasets CIFAR10 and CIFAR100 show that the proposed algorithm accuracy has good performance on the premise of realizing incremental learning.


Introduction
Federal learning (FL) can make full use of all data while keeping participants' data confidential to train a better global model than the local model that each participant trains separately using their own data.Google proposed an FL algorithm for mobile terminals in 2016 [1], each client trains each local model, then aggregates all local models to obtain a better global model.The process of exchanging model information between clients is carefully designed so that clients can learn the private data content of other clients.When the global model is obtained, the data information sources seem to be integrated; this is the core idea of Federated Learning.
After the concept of FL was proposed by Google, H. Brendan McMahan et al. proposed a practical method for the FL of deep networks based on iterative model averaging in 2017 [2].This algorithm uses relatively small communication rounds to train a high-quality model, which is the classic federated averaging algorithm.In the later period, many federated learning algorithms were further developed, and many excellent branch algorithms were formed [3][4][5][6][7][8][9].Liu, Y. et al. proposed a federated transfer learning framework that can be flexibly adapted to various multi-party secure machine learning tasks in 2018 [10].The framework enables the target domain to build a more flexible and efficient model by leveraging many tags in the source domain.Yang, Q. et al. proposed to build a data network based on a federation mechanism in 2019 [11]; it can allow knowledge sharing without compromising user privacy.Zhuo HH and others proposed a federated reinforcement learning framework in 2020 [12]; this framework adds Gaussian interpolation to shared information to protect data security and privacy.Peng Y et al. proposed a federated semi-supervised learning method for network traffic classification in 2021 [13]; it can effectively protect data privacy and does not require a large amount of labeled data.
Traditional FL can only be trained in a batch setting, classification information for all samples is known beforehand, model's classify ability is also fixed.In the face of new tasks and data, it has to re-train totally.Among the many directions of FL at present, research on incremental tasks is rare [14,15].Luo [16] pointed out that traditional data processing technologies have outdated models, weakened generalization capabilities, and do not consider the security of multi-source data.They proposed an online federated incremental learning algorithm for blockchain in 2021.During incremental learning, there will be a problem of unbalanced client samples.Clients with a large sample size have greater weight and significantly impact the final model training results.
Given the above discussion, to dynamically handle the increase of resources without retraining, reduce the impact of large sample clients on the final model training results, and the problem of feature imbalance when model aggregation is elusive.We propose a federated incremental learning algorithm based on a dual attention mechanism.This algorithm can dynamically handle the increase in resources without retraining while mining the characteristics of the overall client-side samples; it can enhance the capture performance of the model training server for the key information of client-side features.The contributions of this paper are as follows: (1) We design a federated incremental learning framework.First, the framework randomly sampling the same number of samples from each client, to ensure the balance of pre-training samples, and trains with the federated averaging model to obtain the preliminary period global model on the server.Then the iCaRL strategy [17] is applied to the traditional FL framework; this strategy can classify samples according to the nearest mean rule, use preferential sample selection based on herd behavior, perform representation learning for knowledge distillation, and dynamically handle resource increases without retraining.Therefore, the federated incremental learning framework can take the dynamic changes in training tasks and keep the data confidential.
(2) The dual attention mechanism is added to the federated incremental learning framework.A channel attention neural network model is designed on the client-side and used as the FL's local model.This model adds the SE module [18] based on the classic Graph Convolutional Neural (GCN) network, which can help the model to obtain the importance of the features of the overall samples of each client during model training and can effectively reduce the influence of noise.In the global model, a federated aggregation algorithm based on the feature attention mechanism is designed to provide appropriate attention weights for each local model.This weight corresponds to the model parameters of each layer of the neural network.The attention weight value is used as the aggregation coefficient, which can enhance the c global model's capture performance for the key features' key information.
This paper is organized as follows: The Section 2 introduces the relevant background information; the Section 3 elaborates on our proposed algorithm; the Section 4 is the experiment performance of this algorithm, and these results are discussed; the Section 5 summarizes the paper.

Federated Averaging Algorithm
In the classic FL algorithm, the federated averaging algorithm is generally used for model training of federated learning.The federated averaging algorithm is mainly model averaging.In the federated averaging algorithm, each client locally performs stochastic gradient descent on the existing model parameters ω t using local data [19], the updated model parameter ω (k) t+1 is sent to the server, the server aggregates the received model parameters, that is, uses a weighted average of the received model parameters, and the updated parameters ω t+1 is sent to each client.This method is called model averaging [20].Finally, the server checks the model parameters, and if it converges, the server sends a signal to each participant to stop the model training. ) In Formula (1), η is the learning rate, ∇ k is the local gradient update of the kth participant, n k is the local data volume of the kth participant, n is the local data volume of all participants, ω (k) t+1 are the parameters of the local model of the kth participant at this time, and ω t+1 are the aggregated global model parameters.

The Basic Structure of Federated Learning
Federated learning is an algorithm that does not need to directly fuse multi-party data for training and only needs to encrypt and exchange client model parameters to train a highperformance gl.Therefore, federated learning can meet the requirements of user privacy protection and data security.Figure 1 is an example of a federated learning architecture that includes a coordinator.In this scenario, the coordinator is an aggregation server, which can send the initial random model to each client.The clients use their respective models to train the model and send the model weight updates to the aggregation server.After that, the aggregation server aggregates the model updates received from the client and sends the aggregated model updates back to the client.Under this architecture, the client's origin is always stored locally, which can protect user privacy and data security [21][22][23][24][25][26].

Class Incremental Learning
Class incremental methods [27][28][29][30][31][32] learn from non-stationary distributed data streams, and these methods should be suitable for a large number of tasks without adding excessive computation and memory.Their goal is to use old knowledge to improve the learning of new knowledge (forward transfer), and to use new data to improve performance on previous tasks (backward transfer).During each training phase, the learner only has access to data for one task [33].The task consists of multiple classes, allowing the learner to process the training data for the current task numerous times during training.A typical class increment setup consists of tn task sequences.
where C is the class, D is the data, and each task ts is represented by a set of classes and training data.Today, most classifiers for incremental learning are trained with a crossentropy loss, the cross-entropy of all classes for the current task.The loss is calculated as follows: where x is the input feature of the training sample, and y ∈ 0, 1 N ts is a truth label vector corresponding to x i .N ts is used to represent the total number of classes for the ts task, N ts = ∑ ts i=1 |C i | , data D ts = (x 1 , y 1 ), (x 2 , y 2 ), ..., (x m ts , y m ts ).We consider incremental learners that are deep networks parameterized by weights θ, and we further split the neural network into a feature extractor f with weights ϕ and a linear classifier g with weights Z according to h(x) = g( f (x; ϕ); Z).But in this case, since softmax normalization is performed on all classes seen in all previous tasks, errors during training will be backpropagated from all outputs, including the output of those classes that do not correspond to the current task.Therefore, we can only consider the network output belonging to the class in the current task, and define the cross-entropy loss as follows, because this loss only considers the softmax-normalized prediction of the class in the current task; therefore errors are only backpropagated from the probabilities associated with these classes in task ts [34].

Federated Incremental Learning Algorithm
In addition to data privacy protection in FL, dealing with the dynamic changes in training tasks is also essential to research content.For example, in recommender systems, user data will be updated dynamically.The traditional training model is in the face of new tasks and data.Retraining all the old and new data will cost much training.Therefore, whether the user data can be dynamically updated has become the key to measuring the pros and cons of an algorithm.We need a federated learning algorithm that can cope with the dynamic changes in training tasks and keep the data confidential.In FL, scholars have found that large-sample client weight parameters are large and have a significant impact on the final model training.The imbalance of importance, and the problem that the importance of features is difficult to capture.This paper introduces a dual attention mechanism in the federated incremental learning framework.We hope to reduce the impact of considerable sample client data noise on the global model, improve the ability of the global model to capture essential client features, and achieve more excellent business value.The idea of this algorithm is as follows: (1) As traditional federated learning can only be trained in a batch setting and needs to be retrained in the face of new tasks and data, more flexible strategies are required to handle large-scale and dynamic real-world object classification situations.This paper proposes a federated incremental learning framework that can cope with the dynamic changes in training tasks and keep the data confidential.We combine the care strategy with a federated learning framework, and dynamically classify samples based on the nearest mean rule to handle resource increases without retraining dynamically.Therefore, the framework can reduce the cost of data storage and model retraining when adding classes, and also reduce the risk of data leakage due to model gradient updates.
(2) Aiming at the problem of sample imbalance in each client, the client with a large sample size greatly significantluences the final model training result.On the federated incremental learning framework, this paper designs a channel attention neural network model on the client-side and uses it as a local model for federated learning.This model adds the SE module based on the classical graph convolutional neural network, which can help the model to obtain the importance of the features of the respective overall samples of each client during model training and can effectively reduce the influence of noise.
(3) For traditional federated learning, the initial parameters of the global model are randomized, and the initial randomized parameters will affect the convergence speed of the global model.A pre-training module is added to the federated incremental learning framework, and the same number of samples are extracted from each client as pre-training data.The federated averaging model is used for training to obtain a global model on the server, which can speed up the model convergence.In addition, extracting the same number of samples from each client can ensure the balance of pre-training samples, to the impact of large sample clients on the global model.
(4) The federated learning design aims to jointly train a high-quality global model for the client.Still, when the client data is unbalanced, the significant sample client weight parameter significantly impacts the final model training result.On the federated incremental learning framework, this paper designs a federated aggregation algorithm based on the feature attention mechanism in the global model to provide appropriate attention weights for each local model.This weight corresponds to the model parameters of each layer of the neural network, and the attention weight value is used as the aggregation coefficient, which can enhance the capture performance of the global model for key feature information.

Federated Incremental Learning Framework
In addition to data privacy protection in FL, dealing with the dynamic changes in training tasks is also essential to research content.For example, in recommender systems, user data will be updated dynamically.The traditional training model is in the face of new tasks and data.Retraining all the old and new data will cost much training.Therefore, whether the user data can be dynamically updated has become the key to measuring the pros and cons of an algorithm.We need a federated learning algorithm that can cope with the dynamic changes in training tasks and keep the data confidential.Because incremental learning can continuously learn new concepts in the data stream [35][36][37], we consider introducing incremental learning in federated learning to help each client first train data when there are only a few classes initially, and then add classes can be gradually added for learning.Still, incremental learning has the problem of historical forgetting.Therefore, we consider adding a new training strategy iCaRL strategy to incremental learning to classify according to the nearest mean rule of samples, use preferential sample selection based on herd behavior, and perform representation learning for knowledge extraction and prototype rehearsal, without retraining Increases in dynamic processing resources.Adding incremental learning to traditional federated learning can reduce the cost of data storage and model retraining when adding classes and reduce the risk of data leakage due to model gradient updates.
In addition, in traditional federated learning, the initial parameters of the global model are all randomly generated numbers, which will affect the convergence speed of the model.To speed up the convergence speed of the model, this paper considers adding a pre-training module.Generally, the pre-training module uses n% of all data as pre-training data.However, this method can easily expand the influence of clients with large sample sizes on the global model, especially the influence of non-important features [38][39][40][41][42][43][44][45][46][47] of large samples in incremental learning.We plan to extract the same number of samples from each client as pre-training data to ensure the balance of pre-training samples and reduce the impact of large-sample clients on the global model.The federated incremental learning framework is shown in Figure 2. Step 1：Client uses a fixed number of samples for training to obtain global model parameters.
Step 2：server distributes the pre-training model to the client.
Step 3：Client uses the latest local data to train the model.
Step 4：Sever merges the received model parameters.
Step 5：Sever distributes the fused model parameters to client.
Step 1 Step 2 Step 5 Class 1 fixed number of samples Step 3 Step

Dual Attention Mechanism Module
The original design concept of federated learning aims to jointly train high-quality global models on the client side without revealing privacy.For example, mobile shopping malls, banks, and other apps with high-quality customer information can recommend suitable shopping and financial products through joint model training without revealing user information.Many large enterprises have trained high-performance global models to help small and medium-sized enterprises in their later development.However, due to the imbalance of samples between clients, there will be a problem that the noise generated by clients with a large sample size when participating in the training will also significantly impact the results of the final trained global model.To this end, given the problem of large samples and considerable noise, consider adding a channel attention mechanism on the client-side, perform feature compression in the spatial dimension to obtain a feature map with a global receptive field, and then learn through fully connected layers (FC) relationship between channels.Finally, multiplying the learned weight coefficients of each channel with all the elements of the corresponding channel can help to minimize the impact of noise on the final model training results while obtaining the characteristics of the respective overall samples of each client.At the same time, federated learning joins incremental learning [48,49] to perform dynamic task training.Learning at the same tsimultaneouslyicatmttttion increments, will lead to the imbalance of sample importance and the problem that the importance of features is difficult to capture.For this reason, it is difficult to capture the imbalance of features.We consider adding a feature attention mechanism [50,51] when the global model is aggregated, to enhance the capture performance of the model training for critical feature information.Therefore, we consider adding a dual attention mechanism module that simultaneously channels attention and features attention to the federated learning framework.The dual attention mechanism module is shown in Figure 3.In Figure 3, the first dual attention mechanism module is to add a channel attention mechanism to the client to help obtain the characteristics of the overall samples of each client while minimizing the impact of noise on the final model training results.There are many models of the local model of the client, such as LSTM, CNN, and other algorithms combined with the SE channel attention module.The architecture of CNN combined with channel attention is shown in Figure 4.The specific process of the client-side neural network model is shown in Figure 5.The number of 1x1 convolution kernels is 16.As shown in Figure 4, the SE module is added after the first convolutional layer of the local model, because the number of channels in the layers too far behind is too large, which is easy to cause overfitting.If the feature map is too small, we use Improper operation will introduce a large proportion of non-pixel information.More importantly, close to the classification layer, the effect of attention is more sensitive to the classification results, and it is easy to affect the decision of the classification layer.Therefore, after placing it in the first convolutional layer, each layer of the convolutional network has 16 convolution kernels, and each convolution kernel corresponds to a feature channel.Channel attention can allocate resources between each convolution channel.The final output is obtained by learning the degree of dependence of each channel and adjusting different feature maps according to the degree of dependence.Therefore, we can add the channel attention mechanism to help focus on the overall characteristics of the sample and reduce the impact of noise on the final model training results.As shown in the figure, the input image size is 32 × 32, that is, M = 32, the number of channels C is 3, and it is input to the convolution layer 1.The convolution layer contains 16 convolution kernels, and the convolution kernel sizes 1 × 1, that is, kernel = 1, and no pixels are filled around the input image matrix, that is, P = 0. We can think of the convolution kernel as a sliding window, which slides forward with a set step size, set the step size to 1, that is, bu = 1.According to the output calculation Formula (6) of the convolutional layer, the size of the output image can be calculated as 32 × 32, that is, Ne = 32, and the number of channels C is 16.

Convolutional
After that, we perform global average pooling on it, perform feature compression on the output image obtained through convolutional layer one along the spatial dimension, and turn each two-dimensional feature channel into an actual number, which has a global effect to some extent.The receptive field and t output dimension match the input's feature channels put, and the number of channels C is 16.After global average pooling, the feature dimension C = 16 is reduced to the dimension C = 8 through the fully connected layer 1, and then activated by ReLU.Then the dimension C = 8 is raised back to the original dimension 16 through the fully connected layer 2, which is used here.The two fully connected layers can better fit the complex correlation between channels with more nonlinearities, and significantly reduce the number of parameters and computation.After the fully connected layer 2, a sigmoid activation function is used to obtain a normalized weight between 0 and 1, and finally, a Scale operation is used to weight the normalized channel attention weight to the features of each channel; it can help focus on the overall feature importance of the sample and reduce the impact of noise on the final model training results.The Scale operation refers to the channel-by-channel weighting of the previous features through multiplication to complete the re-calibration of the original feature map in the channel dimension.After that, we continue using convolutional layer 2 and convolutional layer 3 for feature extraction on the noise-reduced feature map.And adding add activation function ReLU and MaxPool between convolutional layer 2 and convolutional layer 3. The activation function can extract helpful feature information and make the model more discriminative, and the pooling layer can avoid overfitting by down-sampling.Finally, because two or more fully connected layers can solve the nonlinear problem satisfactorily, we input the features of the convolutional layer 3 into the fully connected layer 3 and the fully connected layer 4 to obtain a 512-dimensional vector, then go through the full connection layer.Connection layer 5 receives a c-dimensional vector (c is the number of classes in the current dataset), then inputs it into the softmax classifier for classification, and outputs the predicted value y.
As shown in Figure 3, the second dual attention mechanism module considers that incremental learning, when learning at the same time as incremental classification learning, it the problem of unbalanced features and difficulty in capturing important featurefeaturesd a feature-space attention mechanism to enhance the grasping performance of key information.Feature attention mechanism based global aggregation is shown in Figure 6.As shown in the figure above, to enhance the capture performance of key information and solve the problem that the respective features are unbalanced and the importance is difficult to capture in the incremental learning with the incremental learning of classification and learning at the same time, this paper introduces the feature attention mechanism for the model aggregation of the client, we improve model performance by capturing the importance of neural network layers in multiple local models.This mechanism can automatically consider the weight of the relationship between the server model and the client.In iterative training, continuously updating the parameters reduces the weighted distance between the server and the client model, and the expected distance between the server and the client model is minimized.The optimization objective is calculated as shown in Formula (7).
Among them, ω s t is the model parameter of the server in the tth communication, and ω K t+1 is the model parameter of the client k in the (t + 1)th communication.D(•, •) is the distance between the two sets of neural parameters calculated using the Euclidean distance formula, att k is the important weight of the client model.The hierarchical soft attention method is used to capture the hierarchical importance of the neural network in multiple local models, and it is aggregated into the global model as feature attention to achieve the optimal server and client.The distance between the models is minimized, the expected function of the previous formula is derived to obtain the gradient, and K clients perform gradient descent to update the parameters of the global model.
The importance weight of the client model att k is calculated by hierarchical soft attention.The server-side model is used as a query value, and the client-side model is used as a key value to calculate the attention score of each layer in the neural network.The attention formula is shown in Equation (10).It should be noted that due to the incremental learning, the number of neurons output by the last layer of the model fully connected is the number of dataset classes, which changes dynamically, resulting in loading the local model parameters at this time to the server.When calculating the attention score of each layer together with the global model in the last communication, there will be a problem of weight mismatch.Therefore, before calculating the attention score of each layer, we need to average the weights of the last fully connected layer in all client models, and then assign them to the previous layer parameters of the global model of the latest communication. ) where K is the number of clients, ω l is the model parameter of the lth layer of the server, ω l k is the model parameter of the lth layer of the kth local client, l ∈ [1, L].We take the p-norm of the difference between the matrices as the similarity value of the query and key values of the lth layer and use the softmax function for the similarity value to obtain the attention value of the lth layer of client k, and the feature attention of the entire client

Experimental Analysis
CPU: AMD R5-3600, memory: 16G DDR4, GPU: NVIDIA Geforce RTX2070S, operating system: 64-bit Windows 10; the experimental framework is the Pytorch open-source framework.Stochastic gradient descent is used as the learning rate; the initial learning rate is 0.2, the weight attenuation coefficient is set to 0.00001, the training batch size is 128, the number of local clients is 2, and each class of the Cifar10 and Cifar100 datasets is randomly divided into 2 copies, and sent to each local client in class increment.The local client performs incremental learning for Cifar10 according to 2 classes and 5 classes, and incremental learning for Cifar100 according to 10 classes, 20 classes, and 50 classes.This paper's experiments mainly verify the influence of the two structures on accuracy.
This experiment uses the classic CNN as the network model in the architecture of this article (CNN construction model code reference: https://github.com/jhjiezhang/FedAvg/blob/main/src/models.py, accessed on 1 March 2022).It comparesit with the Federated Averaging model, which uses the same structure of test set CNN.The result of the experiment is an average of 10 times.
The datasets used in this experiment are two public datasets, CIFAR-10 and CIFAR-100.CIFAR-10 is a small dataset for recognizing ubiquitous objects organized by Hinton students Alex Krizhevsky and Ilya Sutskever.The dataset contains a total of 10 classes of RGB color images: airplanes, cars, birds, cats, deer, dogs, frogs, horses, boats, and trucks.The image size is 32 × 32, and there are a total of 50,000 training images and 10,000 test images in the dataset.The CIFAR100 dataset has 100 classes.Each class has 600 color images of size 32 × 32, of which 500 are used as training set and rest 100 are used as test set.The 100 class is composed of 20 classes (each class contains 5 subclasses).To better evaluate the model, set TP to represent the actual class, that is, the positive samples that the model correctly predicts, and FP to represent the actual actualized class, that is, the positive samples that are predicted to be negative by the model.The accuracy formula is as follows:

Ablation Experiment-Rationality Analysis of Pre-Training Module
This experiment is an ablation experiment.To only use the CNN model combined with the pre-training module, instead of the model framework of the second innovation in this algorithm, the traditional FedAvg aggregation method is used to verify the effect of the CNN model combined with the pre-training on the accuracy improvement.
Figures 7 and 8 are the test results of the Incre-FL algorithm and the Icarl-FedAvg algorithm with the pre-training module added to the two test sets, where the x-axis represents the number of classes, the y-axis represents the test accuracy.The number of different classes During the incremental learning process, the average accuracy of the algorithm in this paper is slightly better than the Icarl-FedAvg algorithm.The accuracy of the CIFAR10 dataset when the classification increments are 2 and 5 reaches 43.45% and 63.16%.The accuracy rates on the CIFAR100 dataset with classification increments of 10, 20, and 50 reach 30.03%, 39.35%, and 39%.The algorithm proposed in this paper has made a series of improvements, and finally, the algorithm's formance per has been improved to a certain extent.The influence of the pre-training module on the Incre-FL algorithm is shown in Table 1.We add the pre-training module to speed up the convergence of the model.The module plays a certaspecific in improving the accuracy of image classification tasks.

Ablation Experiment-Rationality Analysis of Dual Attention Mechanism Module
There is a problem of sample imbalance in each client in federated learning.A client with a large sample size has a significant weight parameter, which significantly impacts the final model training result.At the same time, federated learning adds incremental learning for dynamic task training and learns at the same time in classification increments.When the sample importance is unbalanced, the matter is challenging to capture.This experiment is an ablation experiment.To use only the dual attention mechanism method of innovation 2, without using the pre-training module, it is compared with the Icarl-FedAvg learning algorithm to verify the effect of the dual attention mechanism on the accuracy improvement.The influence of the dual attention mechanism module in this paper is shown in Table 2.In the improved Incre-FL algorithm in this paper, the channel attention mechanism is added to the client; we can obtain the characteristics of the overall samples of each client, reduce the influence of noise, and add a feature space attention mechanism during federated aggregation, which can enhance the global model.The capture performance of the client's critical information will ultimately improve the accuracy of the image classification task.The dual attention mechanism in the experiment of this chapter includes the channel attention mechanism and the feature attention mechanism.Next, we verify the accuracy improvement effect of adding the channel attention mechanism and the feature attention mechanism separately.Figures 11 and 12 show the test results of the Incre-FL algorithm, the Icarl-FedAvg algorithm, the SE-Icarl algorithm, and the Earl algorithm with the channel attention mechanism added separately on the CIFAR10 and CIFAR100 standard test sets.It can be seen from the figure that the accuracy of the algorithm in this paper is significantly better than the Icarl-FedAvg algorithm, and the SE-Icarl algorithm with the channel attention mechanism alone is better than the Icarl algorithm.The algorithm's accuracy in this paper on the CIFAR10 dataset is 47.53% and 64.47% when the classification increments are 2 and 5.The accuracies on the CIFAR100 dataset with classification increments of 10, 20, and 50 reach 32.09%, 40.20%, and 41.23%.The influence of the channel attention mechanism in this paper is shown in Table 3.The channel attention mechanism can model the dependencies between channels in the external network, help to focus on the importance of the overall features of the samples, and reduce the impact of noise on the final model training results.We add it to the Earl algorithm and the Icarl-FedAvg algorithm to improve the accuracy of image classification tasks.Figures 13 and 14 show the test results of the Incre-FL algorithm and the Icarl-FedAvg algorithm with the feature attention mechanism added separately on the CIFAR10 and CIFAR100 standard test sets.It can be seen from the figure that the accuracy of the algorithm in this paper is significantly better than that of the Earl-FedAvg algorithm.The algorithm's accuracy in this paper on the CIFAR10 dataset when the classification increments are 2 and 5 reaches 45.40% and 63.73%.On the CIFAR100 dataset, the accuracy rates when the classification increments are 10, 20, and 50 reach 31.02%,39.86%, and 40.32%.The influence of the feature attention mechanism in this paper is shown in Table 4.It can be seen from the table that adding the feature space attention mechanism can improve the accuracy of image classification tasks in incremental learning by enhancing the capture performance of crucial information.show the test results of the algorithm in this paper, the pure Icarl strategy, and the federated averaging algorithm with the Icarl method on the CIFAR10 and CIFAR100 standard test sets.It can be seen from the figure that the average accuracy of the algorithm in this paper is significantly better than the federated averaging algorithm with the Icarl strategy and lower than the incremental learning algorithm Earl.The accuracy rate of Incre-FL is lower than that of the total learning algorithm Icarl because the gradual learning algorithm Icarl directly uses the neural network CNN to train the dataset.In federated learning, the participants will not expose the data to the server or other parties.Hence, the federated learning model performs slightly worse than the centrally trained model, and the additional security and privacy protection are undoubtedly more valuable than the loss of accuracy.Table 5 shows the accuracy of our algorithm on the two datasets.Its test accuracy is higher than the separate addition of the two modules and lower than the test accuracy of CNN training with the Icarl strategy after all client data sets are collected in one place.At the same time, a comparison is made with the Icarl-FedAvg algorithm, the number of clients is set to 2, the client data are all independent and identically distributed, and the comparison is made on two real data sets.Overall, Incre-FL maintains good performance in almost all scenarios.Compared with the Icarl-FedAvg algorithm, the performance improvement of Incre-FL mainly comes from the addition of the pre-training module, which ensures the balance of the pre-training samples and reduces the impact of the large sample client on the global model.We designed a dual attention mechanism module and added the SE module based on the classic graph convolutional neural network.This module can help the model to obtain the characteristics of the overall samples of each client during model training and effectively reduce the impact of noise.The federated aggregation algorithm based on the feature attention mechanism is added to the global model to enhance the capture performance of the global model for critical feature information from three perspectives.Since a blockchain-oriented online federated incremental learning algorithm proposed by Luo Changyin et al. [16] is not a class incremental method, it is not used as comparative data.

Conclusions
In business, people not only need one to ensure data security but also need to be able to cope with the dynamic changes in training tasks.We need an algorithm that can solve the cost loss caused by retraining all old and new data in the traditional training model in the face of new tasks and data.Therefore, whether the user data can be dynamically updated has become the key to measuring the pros and cons of an algorithm.To study the federated learning algorithm that can cope with the dynamic changes of training tasks and keep the data confidential, this paper proposes a federated incremental learning framework, adding a pre-training module to it, which can improve the convergence speed of the model, and proposes a fusion based on a dual attention mechanism.The strategy can help reduce the impact of considerable sample noise on the final model training results.At the same time, it can strengthen the capture of the importance of features and alleviate the feature imbalance problem when adding incremental learning for dynamic task training to classify incremental learning.The experimental data implemented on the standard data set show that the algorithm can improve the accuracy of the common dataset.

Figure 1 .
Figure 1.The federated learning framework proposed in this paper.

Figure 2 .
Figure 2. Federated incremental learning framework.The client first uses a fixed number of samples of class 1 for training and aggregation to obtain pre-trained global model parameters, which can speed up the convergence of the model and reduce the impact of non-important features in large-sample client data on the global model.The global model parameters are distributed to the client, and the client uses the remaining samples of class 1 for training.After the training, the client model parameters of the tth communication are sent to the server, and the server sends the fused model parameters to the client.Then the client uses the class 2 samples and some old data to form a training set for training, uses the feature extractor to extract feature vectors from the old and new data, and calculates the respective average feature vectors.The predicted value is brought into the loss function of the combination of distillation and classification loss for optimization.The client model parameters at the (t + 1)th communication are obtained and sent to the server, so storage until the incremental learning of all classes is completed.

Figure 7 .
Figure 7.The accuracy curve of the pre-training module in the test set Cifar10.(a) Accuracy graph when the class increment is 2. (b) Accuracy graph when the class increment is 5.

Figure 8 .
Figure 8.The accuracy curve of the pre-training module in the test set Cifar100.(a) Accuracy graph when the class increment is 10.(b) Accuracy graph when the class increment is 20.(c) Accuracy graph when the class increment is 50.
Figures 9 and 10 show the accuracy curves of the dual attention mechanism module in test set Cifar10 and Cifar100.

Figure 9 .
Figure 9.The accuracy curve of the dual attention mechanism module in the test set Cifar10.(a) Accuracy graph when the class increment is 2. (b) Accuracy graph when the class increment is 5.

Figure 10 .
Figure 10.The accuracy curve of the dual attention mechanism module in the test set Cifar100.(a) Accuracy graph when the class increment is 10.(b) Accuracy graph when the class increment is 20.(c) Accuracy graph when the class increment is 50.

Figure 11 .
Figure 11.The accuracy curve of the channel attention mechanism module in the test set Cifar10.(a) Accuracy graph when the class increment is 2. (b) Accuracy graph when the class increment is 5.

Figure 12 .
Figure 12.The accuracy curve of the channel attention mechanism module in the test set Cifar100.(a) Accuracy graph when the class increment is 10.(b) Accuracy graph when the class increment is 20.(c) Accuracy graph when the class increment is 50.

Figure 13 .
Figure 13.The accuracy curve of the feature attention mechanism module in the test set Cifar10.(a) Accuracy graph when the class increment is 2. (b) Accuracy graph when the class increment is 5.

Figure 14 .
Figure 14.The accuracy curve of the feature attention mechanism module in the test set Cifar100.(a) Accuracy graph when the class increment is 10.(b) Accuracy graph when the class increment is 20.(c) Accuracy graph when the class increment is 50.

Figure 15 .
Figure 15.Cifar10 accuracy curve.(a) Accuracy graph when the class increment is 2. (b) Accuracy graph when the class increment is 5.

Figure 16 .
Figure 16.Cifar100 accuracy curve.(a) Accuracy graph when the class increment is 10.(b) Accuracy graph when the class increment is 20.(c) Accuracy graph when the class increment is 50.

Table 1 .
The influence of the pre-training module.

Table 2 .
The influence of the dual attention mechanism module.

Table 3 .
The influence of the dual attention mechanism module.

Table 4 .
The influence of the dual attention mechanism module.

Table 5 .
The influence of the dual attention mechanism module.