Next Article in Journal
Soft Fault Diagnosis of Analog Circuit Based on EEMD and Improved MF-DFA
Next Article in Special Issue
Decompose Auto-Transformer Time Series Anomaly Detection for Network Management
Previous Article in Journal
SHO-CNN: A Metaheuristic Optimization of a Convolutional Neural Network for Multi-Label News Classification
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Robust Hierarchical Federated Learning with Anomaly Detection in Cloud-Edge-End Cooperation Networks

1
School of Communication and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
2
Advanced Network and Intelligent Interconnection Technology Key Laboratory of Chongqing Education Commission of China, Chongqing 400065, China
3
Key Laboratory of Ubiquitous Sensing and Networking, Chongqing 400065, China
*
Author to whom correspondence should be addressed.
Electronics 2023, 12(1), 112; https://doi.org/10.3390/electronics12010112
Submission received: 9 December 2022 / Revised: 21 December 2022 / Accepted: 21 December 2022 / Published: 27 December 2022
(This article belongs to the Special Issue Resource Allocation in Cloud–Edge–End Cooperation Networks)

Abstract

:
Federated learning (FL) enables devices to collaborate on machine learning (ML) model training with distributed data while preserving privacy. However, the traditional FL is inefficient and costly in cloud–edge–end cooperation networks since the adopted classical client-server communication framework fails to consider the real network structure. Moreover, malicious attackers and malfunctioning clients may be implied in all participators to exert adverse impacts as abnormal behaviours on the FL process. To address the above challenges, we leverage cloud–edge–end cooperation to propose a robust hierarchical federated learning (R-HFL) framework to enhance inherent system resistance to abnormal behaviours while improving communication efficiency in practical networks and keeping the advantages of the traditional FL. Specifically, we introduce a hierarchical cloud–edge–end collaboration-based FL framework to reduce communication costs. For the framework, we design a detection mechanism as partial cosine similarity (PCS) to filter adverse clients to improve performance, where the proposed lightweight technique has high computation parallelization. Besides, we theoretically discuss the influence of the proposed PCS on the convergence and stabilization of FL. Finally, the experimental results show that the proposed R-HFL always outperforms baselines in general cases under malicious attacks, which further shows the effectiveness of our scheme.

1. Introduction

With the rapid development of communication technologies, a large amount of heterogeneous data is generated at network edges [1,2]. To enable intelligent network services, such as vehicle scheduling [3,4], image processing [5], edge computing and caching [6,7], machine learning (ML), a data-driven technology, can be applied. Given that the centralized learning paradigm of cloud computing generates huge transmission delays and risks user privacy, it is required to use an efficient and privacy-preserving paradigm to online learn policies for real-time intelligent inference in heterogeneous cloud–edge–end cooperation networks.
As one of the privacy-preserving distributed learning paradigms, federated learning (FL) [8,9] can train a common ML model collaboratively from heterogeneous data generated at network edges. In traditional FL (or FedAvg [8]), the client-server framework is generally adopted: the ML model is downloaded to each client, locally updated in parallel, and uploaded to the server for parameter averaging. The framework includes two popular architectures in practical deployments: cloud-based FL [10,11,12] and edge-based FL [13,14,15]. The cloud-based FL sets the cloud as the server, which permits a large number of clients to participate in the FL training for advantageous global performance, but also incurs high communication overhead and time delays due to network congestion and large model size. The edge-based FL sets the edge as the server, which only holds partial clients near the edge server in the learning system for efficient training, which yields a biased approximation of the global optimum with poor performance. Thus, using the client-server framework without considering cloud–edge–end cooperation in networks leads to the traditional FL not being suitable for general network applications requiring high performance with low latency [3,4,5,6,7].
To combine the advantages of the two FL types and leverage the real-world network communication system for intelligent network applications, hierarchical federated learning (HFL) [16] is proposed as a bi-level cloud–edge–client framework that cooperatively trains ML models on the cloud and edge sides. Specifically, on the edge side, multiple edge-based FL updates are performed in parallel: each edge server communicates with disjoint clients to derive the respective edge model. Then, on the cloud side, a centralized cloud server communicates with edge servers to average all uploaded edge models (which can regard edge servers as clients in the cloud-based FL). The advanced paradigm technically solves the trade-off between communication and computation while inheriting the privacy preservation characteristics from the traditional FL. However, this inherited property (privacy preservation) also leads to the HFL servers having no access to raw data or full control over participators’ behaviours. Consequently, malicious or malfunctioning heterogeneous clients may upload incorrect or noisy model values to the server (originally named Byzantine attacks or poisoning), which strongly hurts the FL training and degrades the final performance [17].
To avoid abnormal behaviours degrading training performance, we design a new HFL method with a lightweight detection mechanism that can distributively detect adverse clients’ uploaded information at the edge side for system robustness. In particular, we divide and offload the detection task to the edge, i.e., deploying a partial detector on each edge server for identification. In addition, these partial detectors are simply established by applying the cosine similarity technique, which detects abnormal behaviour by measuring the distance between the previous edge aggregation model and the uploaded client model.
The main contributions are summarized in this paper as follows:
  • A novel method called R-HFL is proposed to ensure the training performance under Byzantine attacks by the distributed anomaly detection mechanism customized for the HFL framework.
  • The effectiveness and convergence of our proposed method are mathematically discussed in the general non-convex case.
  • Numerical results are provided to experimentally show that our proposed algorithm can effectively minimize the negative impact of abnormal behaviours, which further illustrates the feasibility of distributive detection in the HFL system.
The remainder of the paper is organized as follows. In Section 2, related works are introduced. In Section 3, the learning system architectures are described. In Section 4, the new R-HFL algorithm is proposed and discussed. In Section 5, the effectiveness and convergence of R-HFL are analysed. In Section 6, experiment results are presented. Finally, the paper is concluded in Section 7.

2. Related Work

Federated learning includes horizontal federated learning, vertical federated learning and federated transfer learning, where we focus on the first type in this paper and call it federated learning (FL) for short. FL was first proposed in [8], named FedAvg, which provided a guideline on local parameter aggregation. Thus, many works followed FedAvg to analyse the theoretical performance. Refs. [18,19,20,21] analysed the FL convergence performance in iid settings (i.e., raw data is independent-and-identically-distributed), where Hao and et al. in [18] showed the result for non-convex optimization objectives. Additionally, as heterogeneity is ubiquitous in a realistic environment, some researches tried to analyse the FL convergence on non-iid (non-independent-and-identically-distributed) data in inconsistent computation capability. In the works [22,23,24], the FL convergence was analysed in terms of the error bound under the constraint of the difference between the local and global gradients. In particular, the results in [25] pointed out that FedAvg was not appropriate to train non-i.i.d data. In [26], the authors proved that when local training steps were heterogeneous among local clients, many federated learning schemes including FedAvg would converge to non-optimal stationary model parameters. Other works are devoted to deploying FedAvg to practical networks. Since the upload of poorly trained local models (due to statistical and system heterogeneity) occupies the network bandwidth, which increases communication overhead and influences the learning system efficiency, client scheduling is a critical issue [10,27]. Those works discussed scheduling policies that sample a subset of clients in aggregation for communication cost-saving, while [28] studied the problem in wireless communication scenarios, where communication link quality was optimized in the parameter aggregation stage of FL. Besides client scheduling, hierarchical FL [16] is another promising approach for effectively reducing communication overhead, which improves the system training efficiency by considering the practical network architecture. Works as [29,30] aimed to optimize the HFL performance via solving resource allocation problems as client-edge assignments.
The above works focused on performance improvement for FL in data and system aspects, which lacked discussion on learning system security while the distributed learning framework across multiple clients left FL vulnerable. Recently, some defensive techniques [31,32,33,34,35] have been proposed against poisoning attacks. For instance, [31] discussed methods of abnormal client update identification in FL. The work [32] raised the Krum model to check a deviation in input parameters from each participator. The work [33] designed the Auror scheme, which can detect malicious users and generates corresponding accurate models by inserting a clustering operation on each local update before an aggregation round as a defensive scheme. The works [34,35] proposed auto-encoder-based approaches to help identify malicious clients. Whereas, these studies aimed to build defensive mechanisms for the classical distributed ML or the traditional FL, i.e., they deployed a single detector for the server in the classical two-tier client-server architecture, which cannot be directly applicable to the HFL system since the layered structure prevents the above algorithms from accessing all essential information uploaded by clients. Additionally, some of them as [34,35] have heavy extra communication and computation costs to train a neural network for assisting anomaly detection. Thus, it is required to design a new detection mechanism adopted for the HFL system in this paper.

3. Learning System Architecture

In this section, we first review the traditional FL with the client–server architecture based on the typical algorithm FedAvg. Then we introduce the hierarchical FL framework with the advantageous cloud–edge–client architecture.

3.1. Traditional Federated Learning

For all participating clients in N , the FL optimization objective is usually depicted as follows:
min x R d f ( x ) : = 1 | N | i N f i ( x )
where d denotes the dimension of the model parameterized vector x , and  f i ( x ) represents the sub-objective function defined by the local ML model on each client i with f i ( x ) : = E ϑ i D i [ f ( x ; ϑ i ) ] . Additionally, D i is the local dataset of client i while ϑ i is a data sample from it. We depict a round of the FedAvg algorithm as three stages: (i) (server) broadcasting, (ii) (client) updating, and (iii) (server) aggregation. To be specific, in the k-th communication round, we have the following.
(i) Broadcasting: The server initializes (if necessary) and then broadcasts a common global model x k to each participating client i N .
(ii) Updating: Each client i initializes the local model x i , 0 k as x k and sequentially updates the model by performing stochastic gradient descent (SGD) with a step-size η on f i ,
x i , j + 1 k x i , j k η g i ( x i , j k ) , j = 0 , 1 , . . . , τ 1 ,
where x i , j k denotes the updated local model in the j-th step SGD, g i ( · ) represents the stochastic gradient of f i ( · ) (i.e., the short of f ( · ; ϑ i ) with ϑ i D i ). Besides, while the local update step j τ , client i uploads its trained model to the server.
(iii) Aggregation: The server receives aggregates all trained local models to generate a new global one for the next training round, i.e.,
x k + 1 1 | N | i N x i , τ k .
The whole process is completed when k K .
We summarize the process as Algorithm 1 to search for the optimal solution of the objective (1), which equals the procedure of cloud-based FL when regarding the cloud as a server. In addition, in the case of the edge-based FL, only a fraction of clients near the edge server (i.e., N m N , m M ) can participate in the training process since the server has a limited communication range. In other words, in Line 3 of Algorithm 1, the subset N m with any edge server m is input into the ServerUpdate operation in place of the whole client set N , where ServerUpdate ( · ; · ) ensembles all operations presented in the preceding subsection (i.e., broadcasting, updating and aggregation).
Algorithm 1 Traditional Client-Server FL.
Input: local update step τ , total communication round K
Output: optimized global model x *
1:
server randomly initializes global model x k with counter k = 0
2:
for server in each round k = 0 , 1 , , K 1 do
3:
       x k + 1 ServerUpdate ( x k ; N )
4:
end for
5:
server returns x * x K to all clients
 ————————————————
6:
ServerUpdate ( x ; I ) :
7:
send x to clients i I
8:
for each client i I in parallel do
9:
       x i ClientUpdate ( x ; i )
10:
    send x i to server
11:
end for
12:
return 1 | I | i I x i
 ————————————————
13:
ClientUpdate ( x ; i ) :
14:
initialize x i , j x with counter j = 0
15:
for j = 0 , 1 , . . . , τ 1 do
16:
       x i , j + 1 x i , j k η g i ( x i , j k )
17:
end for
18:
return x i , τ

3.2. Hierarchical Federated Learning

As we discussed in the previous sections, performing cloud-based FL involves massive clients, but incurs high communication overhead, while performing edge-based FL only incorporates a limited number of clients with cheaper costs, incurring training performance loss, thus we introduce the hierarchical FL (HFL) for combining their advantages. Intuitively, we perform edge-based FL in parallel on several edge servers M and then aggregate their edge-level global model (or edge model for short) after E rounds client-edge communication at a cloud server. Each edge server m M communicates with clients N m in its range, which yields a family of disjoint sets ( N m ) m M , i.e.,  N m N m = , m , m M . We summarize the process as Algorithm 2. Indeed, in this framework, the global model is updated as Line 11 of Algorithm 2,
x k + 1 1 | M | m M x m k , E ,
where x m k , E represents the updated edge model, which is obtained by performing ServerUpdate operation on x k with the subset of clients N m E times in the k-th cloud communication round, as Line 4–10 of Algorithm 2. Therefore, the cloud server communicates with | M | edge servers rather than | N | clients, where the overhead is significantly reduced.
Algorithm 2 Hierarchical cloud–edge–client FL.
Input: local update step τ , total communication round K, edge server set M , fractions of clients ( N m ) m M
Output: optimized global model x *
1:
cloud randomly initializes global model x k with counter k = 0
2:
for each round k = 0 , 1 , . . . , K 1 do
3:
      send x k to edge m M
4:
      for each edge m M in parallel do
5:
            initialize x m k , e x k with counter e = 0
6:
            for each round e = 0 , 1 , . . . , E 1 do
7:
                    x m k , e + 1 ServerUpdate ( x m k , e ; N m )
8:
            end for
9:
            send x m k , E to cloud
10:
      end for
11:
      aggregate x k + 1 1 | M | m M x m k , E
12:
end for
13:
server returns x * x K to all clients

4. A Lightweight Detection Mechanism for Hierarchical Federated Learning

The HFL system is promising to well balance training efficiency and performance, where in each training round, on the client i, the model x is trained with the private dataset to minimize the local loss f i ( x ) . However, due to the statistical heterogeneity between end devices and malicious tampering of raw data, the convergence direction of the updated local model may be significantly different from or even opposite to the convergence direction of the global model. Such clients participating in the model aggregation yield a negative impact on the FL performance and convergence. To ensure the effectiveness of the FL framework, it is necessary to detect and filter adverse clients. In this section, we propose a lightweight mechanism that can detect malicious attackers at the parameter level as Figure 1, which is introduced in detail subsequentially.
Executing additional detection tasks may increase the computation time. Especially the detection task is processed in a single server in the classical FL system. Note that in an HFL system, the task can be divided and offloaded to edge servers for execution, that is, each edge server processes its individual partial detection flow simultaneously, which can be a more inherently parallelizable paradigm for reducing physical training time. Formally, let the detector as D : N N , the whole detection task can be described as D ( N ) , which is used to filter adverse clients in N and return a trustworthy subset N . We aim to establish a group of detectors ( D m ) m M at the edge, which hold
N = m M N m = m M D m ( N m ) : = D ( m M N m ) = D ( N )
to replace a single detector at the cloud server. Thus, we can perform D m in parallel at each edge server m to accelerate the detection process.
To avoid introducing heavy computation costs in the detection phase, we let each edge server m equip a simple detector D m , which directly calculates the cosine similarity, i.e.,
similarity ( x , x ) = x , x x x ,
between the previous edge model x m k , e and currently uploaded client models ( x m , i , τ k , e ) i N m (or ( x m , i k , e ) i N m in the absence of ambiguity) to evaluate whether the participator is working properly by a defined threshold t. Specifically, if  similarity ( x m , i k , e , x m k , e ) > t , we claim the client can normally participate in this training round; otherwise, we drop the client to protect model-aggregation. Hence, the edge detector D m checks all local models uploaded from clients N m and then returns a trusted subset of clients to the server. Thereafter, the edge server aggregates local models from filtered clients D m ( N m ) . The procedure is summarized as Algorithm 3. The whole architecture is the same as Algorithm 2, however with a surrogate ServerUpdate operation as shown in Lines 2–9. Especially, the additional detection process is represented in Line 8 and Lines 10-15, which is performed on each edge server m as the detector D m for checking clients N m .
Algorithm 3 R-HFL: Robust Hierarchical FL.
1:
run Algorithm 2 with the surrogate ServerUpdate function:
2:
SurrogateServerUpdate ( x ; I ) :
3:
send x to clients i I
4:
for each client i I in parallel do
5:
       x i ClientUpdate ( x ; i )
6:
      send x i to server
7:
end for
8:
detect I D ( I ; t )
9:
return 1 | I | i I x i
 ————————————————
10:
D ( I ; t ) P C S ( I ; t ) :
11:
I
12:
if x i , x x i x | i I > t then
13:
       I I { i }
14:
end if
15:
return I

5. Further Analysis on Partial Cosine Similarity

The proposed detector is named after the partial cosine similarity (PCS). In this section, we mathematically show that PCS can control the update deviation to effectively stabilize the HFL process to resist poisoning attacks. Then, we claim that the HFL training model with PCS converges to a stationary point of a weighted version of the original objective (1) in general non-convex cases. All relevant proofs are provided in the Appendix A, Appendix B and Appendix C.
Before introducing the pertinent main results, one can first review some important conditions in the proof of HFL convergence [16].
Assumption A1 
(smoothness). The gradient of each local loss f i is L-Lipschitz continuous, i.e.,
f i ( x ) f i ( x ) L x x , x , x dom f i ,
hence, each edge loss f m : = 1 | D m ( N m ) | i D m ( N m ) f i in edge server m is consequently L-Lipschitz continuous.
Assumption A2 
(gradient divergence). For any x , the gradient divergence δ i m between each local loss f i and the corresponding edge loss f m | i N m satisfies
( f i f m ) ( x ) δ i m ,
while the divergence Δ m between each edge loss f m and the global loss f satisfies
( f m f ) ( x ) Δ m .
Theorem 1 
(controlled deviation). The gradient deviation of f m in an edge communication round is controlled by the similarity threshold t [ 1 , 1 ] , that is,
f m ( x m k , e + 1 ) f m ( x m k , e ) 2 L 2 ( x m k , e + 1 x m k , e ) 2 2 ( t 1 ) L 2 x m k , e + 1 x m k , e .
Remark 1. 
The theorem depicts that the deviation is controlled in a limited range about the threshold t, especially as t goes up, the deviation control becomes more strict. This also explains that excluding local models which may hurt the edge model x m k , e (i.e., similarity ( x m , i k , e , x m k , e ) t ) from the aggregation can stabilize the HFL training on the edge-side and then diffuse the impact to the cloud.
Theorem 2 
(convergence to a critical point of the weighted objective). Given that (i) each client i has a probability p i [ 0 , 1 ] of being dropped due to the PCS mechanism in each edge communication round, (ii) the local learning rate η : = η ( k ) varies with the cloud communication round k and satisfies k = 0 η ( k ) = , k = 0 η 2 ( k ) = 0 , then for any positive number ϵ, the average-squared gradient norm of f ˜ has the following:
lim K 1 k = 0 K η ( k ) k = 0 K η ( k ) f ˜ ( x k ) 2 ϵ ,
where f ˜ : = 1 | N | i N ( 1 p i ) f i is a weighted version of the original one (1).
Corollary 1 
(convergence bound according to the original loss). In terms of Theorem 2, suppose ( x k ) k K is the sequence generated by performing the R-HFL method, the minimal gradient norm of the original global objective f is bounded as the following:
lim K min k K f ( x k ) 2 2 C p , δ , Δ + 2 | N | 2 ( i N ( 1 p i ) ) 2 ϵ ,
where C p , δ , Δ : = i N ( 1 | N | 1 p i i N ( 1 p i ) ) 2 i N ( δ i m + Δ m ) 2 | i N m measures the deviation degree, which is determined by the learning system properties p , δ , Δ .
Remark 2. 
Theorem 2 illustrates that processing the proposed R-HFL is equivalent to performing SGD on the weighted objective f ˜ rather than the original one f, which shows that applying PCS causes the optimization objective drift to raise the difficulty of achieving the ideal training performance. Additionally, Corollary 1 further presents that the drift is mainly controlled by the drop probability ( p i ) i N , where we have p i = 0 , i N such that C p , δ , Δ = 0 . Combining Theorems 1 and 2 and Corollary 1, notice that a small similarity threshold t decreases the detection strength while a large one increases the drop probability, thus properly setting t is crucial for achieving optimal performance.

6. Experiment Results

In this section, experiment results are presented and discussed. Firstly, the experimental setup is established. Then, the effectiveness of the proposed R-HFL is discussed under different types of attacks. Finally, the effects of the crucial factor, threshold t, on final training performance are analysed in detail.

6.1. Experiment Setup

We set 5 edge servers ( | M | = 5 ) where each one communicates with 5 independent clients ( | N m | = 5 , m M ). Thus, the number of all participators is set to 25 ( | N | = 25 ). Each client, holding heterogeneous raw data with the non-iid splits [16], is randomly assigned to any edge server. Specifically, each client’s dataset contains only two randomly proportioned classes of labels, which hold different numbers of samples.

6.1.1. Model and Dataset

The classical multilayer perceptron (MLP) is used as the basic ML model, where the corresponding loss function is set as cross-entropy. Testing accuracy is used as the primary performance metric. The popular public dataset MNIST is used to train ML models for the image recognition task with the HFL framework, which contains 70,000 gray-scale images of handwritten digits, 60,000 for training and the remaining 10,000 for testing.

6.1.2. Adversarial Attacks

We investigate the performance of the proposed R-HFL under different poisoning attacks [17,31] as follows.
(i)
Additive Noise of Model Parameter (ANMP) adds Gaussian noise to local model parameters, where the noise obeys N ( 0 , 1 ) . That is, for client i, any component e j i of the local model parameters x i satisfy that e j i e j i + ξ | ξ N ( 0 , 1 ) , where x i = ( e 1 i , , e j i , , e d i ) .
(ii)
Sign-Flipping of Model Parameter (SFMP) flips the sign of local model parameters, i.e., for client i, the local model parameters x i x i .
(iii)
Additive Noise of Data Distribution (ANDD) adds Gaussian noise to local datasets, where the noise obeys N ( 0 , 10 ) , which is similar to ANMP.
In this paper, we consider that there are no completely trusted participators in the FL training process, i.e., all anonymous participating clients may contain malicious intent or suffer from malfunctions or interference in the process, hence all clients have a probability (defined as p r o b , 0 p r o b 1 ) to send the poisoned local model to the server in any edge communication round. Thus, p r o b can quantitatively characterize the intensity of an attack.

6.1.3. Hyper-Parameters

The learning rate η is initialized as 0.01 with a decay coefficient 0.995 each cloud round. The local training batch size is set to 20. The number of local training epochs τ is set to 10. The frequency of edge aggregations E is set to 5 per cloud round, while the number of total cloud aggregations K is set to 10, i.e., the total number of local training epochs on each client K E τ is 500, the total number of edge communication rounds on each edge server K E is 50. Additionally, the attack intensity p r o b and the PCS detection threshold t default to 0.5 .

6.2. Numerical Results

6.2.1. Performance Evaluation under Different Attacks

Figure 2 describes the global testing accuracy varying with cloud communication rounds as well as the corresponding averaged edge testing accuracy varying with edge rounds, used to depict the final training performance of our proposed method in various settings. The edge testing accuracy is jaggedly increasing due to statistical heterogeneity [13]. Specifically, as shown in (a)(b), in the Traditional FL, all clients directly interact with a single server regardless of communication costs, which represents the ideal performance upper bound. HFL/R-HFL depicts the results without attacks, which illustrates that our proposed PCS detection mechanism has no effect on the theoretical performance. For R-HFL under attacks and HFL under attacks, we find that the proposed R-HFL can effectively alleviate the negative impact of attacks (take ANMP as an example) compared to the original HFL algorithm. Moreover, as shown in (c)(d), we investigate the R-HFL global model performance under different intensities of attacks (take ANMP and SFMP as examples). As the p r o b increases, the performance in each communication round uniformly degrades, which shows attack intensity unavoidably influencing the FL training. Additionally, combining (a) and (c) (or (b) and (d)), we find the degradation is much more limited in R-HFL versus the original HFL due to the detection mechanism.

6.2.2. Effect of the Key Factor: Threshold t

Figure 3 depicts the effect of the key factor threshold t on R-HFL performance in the two model-poisoning scenarios. As shown in (a)(b), it is observed that an appropriate set of t for the proposed R-HFL exists to reach the optimal performance under the ANMP attack. Specifically, when t = 0 , we show that the performance of R-HFL is only slightly better than the original one under the attack, which means PCS detectors miss many adverse models in filtering due to the relaxed threshold. When t = 0.5 , the performance is close to the original HFL without the attack, which means the detectors successfully filter all adverse models, but the performance degrades due to the decrease in the number of normal participators [16,36]. When t = 0.95, the performance further degrades, which means the detectors not only filter all adverse models, but also filter not a few normal local models with high data heterogeneity, especially in early training rounds. In (c,d) (under the SFMP attack), we obtain similar conclusions, but it is noticed that when t = 0 , the performance of R-HFL is unchanged from the case of t = 0.5 . It is illustrated that SFMP attacks cause larger gradient changes than ANMP attacks to some extent (in MNIST public dataset, when the noise obeys standard normal distribution) since apparent model difference results in relaxed feasible threshold t.
Figure 4 depicts the effect of the threshold t in the data-poisoning scenario. As shown in (a,b), one can note that when t ≤ 0.5, the performance of R-HFL degenerates to the original HFL under the ANDD attack, which shows the model difference is small in the scenario, hence the lower limit of the feasible detection threshold is higher than the previous two attacks. Moreover, when t = 0.65 , we show that the performance of R-HFL is even worse than the original one under the attack, which indicates that much adverse information is missed in the filtering process, and what is worse is that even if a small fraction of adverse models are filtered out, its performance gain will be offset by the reduction in the number of normal participating clients. The improvement occurs at t = 0.8 , as the threshold t becomes large, malicious information is totally dropped out. Although the reduction of participating customers slows down the convergence rate, the R-HFL model ends up with a much better global performance than the poisoning model.
Additionally, Table 1 summarizes the testing accuracy under all scenarios in different settings, where the results show that the proposed R-HFL method has significant advantages over the baselines. Furthermore, we provide the corresponding detection accuracy in brackets, which indicates that the test accuracy is positively correlated with the detection accuracy.

7. Conclusions

In this paper, a new hierarchical cloud–edge–end collaboration-based FL method called R-HFL was proposed. The corresponding properties of stabilization and convergence were analysed and discussed, and relevant experiments were studied elaborately. Specifically, the hierarchical architecture was considered in the R-HFL framework to explicitly reduce the required cloud communication round from K E to K during training. Moreover, as a distributed anomaly detection technique, PCS was proposed and applied in R-HFL to stabilize and improve performance under poisoning attacks. The experimental results are in agreement with the theoretical analyses, which convincingly demonstrate the validity of the proposed R-HFL. Future work can investigate how to expand the paper’s result to effectively enhance the Byzantine robustness of multilayered FL systems with high global information diffusion delays, and more generally, how to leverage global information to mitigate the local detection bias caused by data heterogeneity for better stabilization and convergence.

Author Contributions

Y.Z.: investigation; methodology; writing—original draft. R.W.: conceptualization; investigation; methodology; validation; visualization. X.M.: writing—original draft. Z.L.: writing—review and editing; supervision. T.T.: conceptualization; data curation; project administration; resources; supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under grants 61901078, 62271096, 61871062, and U20A20157, and in part by the China University Industry-University-Research Collaborative Innovation Fund (Future Network Innovation Research and Application Project) under grant 2021FNA04008, and in part by the China Postdoctoral Science Foundation under grant 2022MD713692, the Chongqing Postdoctoral Science Special Foundation under grant 2021XM2018, the Natural Science Foundation of Chongqing under grant cstc2020jcyj-zdxmX0024, University Innovation Research Group of Chongqing under grant CXQT20017, and the Youth Innovation Group Support Program of ICE Discipline of CQUPT under grant SCIE-QN-2022-04.

Data Availability Statement

Not applicable.

Acknowledgments

The authors would like to thank the reviewers for their thorough and constructive comments that have helped improve the quality of the paper.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
MLMachine Learning
FLFederated Learning
HFLHierarchical Federated Learning
R-HFLRobust Hierarchical Federated Learning
SGDStochastic Gradient Descent
PCSPartial Cosine Similarity
ANMPAdditive Noise of Model Parameter
SFMPSign-Flipping of Model Parameter
ANDDAdditive Noise of Data Distribution
MLPMultilayer Perceptron

Appendix A. Proof of Theorem 1

Directly process the smoothness condition, we have
f m ( x m k , e + 1 ) f m ( x m k , e ) 2 L 2 x m k , e + 1 x m k , e 2 L 2 ( x m k , e + 1 2 + x m k , e 2 2 x m k , e + 1 , x m k , e ) = L 2 x m k , e + 1 2 + x m k , e 2 2 1 | D m ( N m ) | i D m ( N m ) x m , i k , e , x m k , e L 2 x m k , e + 1 2 + x m k , e 2 2 t 1 | D m ( N m ) | i D m ( N m ) x m , i k , e x m k , e L 2 ( x m k , e + 1 2 + x m k , e 2 2 t x m k , e + 1 x m k , e ) .
Putting L 2 in the parenthesis, we complete the proof. □

Appendix B. Proof of Theorem 2

Combine [37] ([Lemma 5]) and [16] ([Theorem 2]) to obtain the result. □

Appendix C. Proof of Corollary 1

Note that f ( x ) = 1 | N | i N f i ( x k ) , then in any cloud aggregation round k, the following holds:
( f | N | i N ( 1 p i ) f ˜ ) ( x k ) = i N ( 1 | N | 1 p i i N ( 1 p i ) ) f i ( x k ) = i N ( 1 | N | 1 p i i N ( 1 p i ) ) ( f i f ) ( x k )
Adding square norms on both sides and applying Cauchy-Schwarz inequality, one can obtain
( f | N | i N ( 1 p i ) f ˜ ) ( x k ) 2 i N ( 1 | N | 1 p i i N ( 1 p i ) ) 2 i N ( f i f ) ( x k ) 2 = i N ( 1 | N | 1 p i i N ( 1 p i ) ) 2 i N ( f i f m | i N m + f m | i N m f ) ( x k ) 2 i N ( 1 | N | 1 p i i N ( 1 p i ) ) 2 i N ( δ i m + Δ m ) 2 | i N m : = C p , δ , Δ
Observe the fact that
f ( x k ) 2 2 ( f | N | i N ( 1 p i ) f ˜ ) ( x k ) 2 + 2 | N | i N ( 1 p i ) f ˜ ( x k ) 2 2 C p , δ , Δ + 2 | N | 2 ( i N ( 1 p i ) ) 2 f ˜ ( x k ) 2
Finally, processing the following inequality
lim K min k K f ( x k ) 2 lim K 1 k = 0 K η ( k ) k = 0 K η ( k ) f ( x k ) 2 2 C p , δ , Δ + 2 | N | 2 ( i N ( 1 p i ) ) 2 lim K 1 k = 0 K η ( k ) k = 0 K η ( k ) f ˜ ( x k ) 2 ,
we yield the claim by using Theorem 2. □

References

  1. Wang, X.; Han, Y.; Wang, C.; Zhao, Q.; Chen, X.; Chen, M. In-edge ai: Intelligentizing mobile edge computing, caching and communication by federated learning. IEEE Netw. 2019, 33, 156–165. [Google Scholar] [CrossRef] [Green Version]
  2. Zhou, Z.; Chen, X.; Li, E.; Zeng, L.; Luo, K.; Zhang, J. Edge intelligence: Paving the last mile of artificial intelligence with edge computing. Proc. IEEE 2019, 107, 1738–1762. [Google Scholar] [CrossRef] [Green Version]
  3. Fan, B.; Wu, Y.; He, Z.; Chen, Y.; Quek, T.Q.; Xu, C.Z. Digital Twin Empowered Mobile Edge Computing for Intelligent Vehicular Lane-Changing. IEEE Netw. 2021, 35, 194–201. [Google Scholar] [CrossRef]
  4. Wang, Z.; Gao, Y.; Fang, C.; Liu, L.; Zeng, D.; Dong, M. State-Estimation-Based Control Strategy Design for Connected Cruise Control With Delays. IEEE Syst. J. 2022, 1–12. [Google Scholar] [CrossRef]
  5. Tang, T.; Li, L.; Wu, X.; Chen, R.; Li, H.; Lu, G.; Cheng, L. TSA-SCC: Text Semantic-Aware Screen Content Coding With Ultra Low Bitrate. IEEE Trans. Image Process. 2022, 31, 2463–2477. [Google Scholar] [CrossRef] [PubMed]
  6. Li, Z.; Zhu, N.; Wu, D.; Wang, H.; Wang, R. Energy-Efficient Mobile Edge Computing Under Delay Constraints. IEEE Trans. Green Commun. Netw. 2022, 6, 776–786. [Google Scholar] [CrossRef]
  7. Li, Z.; Gao, X.; Li, Q.; Guo, J.; Yang, B. Edge Caching Enhancement for Industrial Internet: A Recommendation-Aided Approach. IEEE Internet Things J. 2022, 9, 16941–16952. [Google Scholar] [CrossRef]
  8. McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 20–22 April 2017; pp. 1273–1282. [Google Scholar]
  9. Konečnỳ, J.; McMahan, H.B.; Ramage, D.; Richtárik, P. Federated optimization: Distributed machine learning for on-device intelligence. arXiv 2016, arXiv:1610.02527. [Google Scholar]
  10. Luo, B.; Xiao, W.; Wang, S.; Huang, J.; Tassiulas, L. Tackling System and Statistical Heterogeneity for Federated Learning with Adaptive Client Sampling. In Proceedings of the IEEE INFOCOM 2022-IEEE Conference on Computer Communications, London, UK, 2–5 May 2022; pp. 1739–1748. [Google Scholar] [CrossRef]
  11. Yang, H.H.; Liu, Z.; Quek, T.Q.; Poor, H.V. Scheduling policies for federated learning in wireless networks. IEEE Trans. Commun. 2019, 68, 317–333. [Google Scholar] [CrossRef] [Green Version]
  12. Xue, Q.; Liu, Y.J.; Sun, Y.; Wang, J.; Yan, L.; Feng, G.; Ma, S. Beam Management in Ultra-dense mmWave Network via Federated Reinforcement Learning: An Intelligent and Secure Approach. In IEEE Transactions on Cognitive Communications and Networking; IEEE: Piscataway, NJ, USA, 2022; p. 1. [Google Scholar] [CrossRef]
  13. Wang, S.; Tuor, T.; Salonidis, T.; Leung, K.K.; Makaya, C.; He, T.; Chan, K. Adaptive federated learning in resource constrained edge computing systems. IEEE J. Sel. Areas Commun. 2019, 37, 1205–1221. [Google Scholar] [CrossRef] [Green Version]
  14. Li, Z.; Zhou, Y.; Wu, D.; Tang, T.; Wang, R. Fairness-Aware Federated Learning With Unreliable Links in Resource-Constrained Internet of Things. IEEE Internet Things J. 2022, 9, 17359–17371. [Google Scholar] [CrossRef]
  15. Tran, N.H.; Bao, W.; Zomaya, A.; Nguyen, M.N.; Hong, C.S. Federated learning over wireless networks: Optimization model design and analysis. In Proceedings of the IEEE INFOCOM 2019-IEEE Conference on Computer Communications, Paris, France, 29 April–2 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1387–1395. [Google Scholar]
  16. Liu, L.; Zhang, J.; Song, S.; Letaief, K.B. Client-Edge-Cloud Hierarchical Federated Learning. In Proceedings of the ICC 2020-2020 IEEE International Conference on Communications (ICC), Dublin, Ireland, 7–11 June 2020; pp. 1–6. [Google Scholar] [CrossRef]
  17. Mothukuri, V.; Parizi, R.M.; Pouriyeh, S.; Huang, Y.; Dehghantanha, A.; Srivastava, G. A survey on security and privacy of federated learning. Future Gener. Comput. Syst. 2021, 115, 619–640. [Google Scholar] [CrossRef]
  18. Yu, H.; Yang, S.; Zhu, S. Parallel restarted SGD for non-convex optimization with faster convergence and less communication. arXiv 2018, arXiv:1807.06629. [Google Scholar]
  19. Stich, S.U. Local SGD Converges Fast and Communicates Little. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  20. Wang, J.; Joshi, G. Cooperative SGD: A unified framework for the design and analysis of communication-efficient SGD algorithms. arXiv 2018, arXiv:1808.07576. [Google Scholar]
  21. Stich, S.U.; Karimireddy, S.P. The error-feedback framework: Better rates for SGD with delayed gradients and compressed communication. arXiv 2019, arXiv:1909.05350. [Google Scholar]
  22. Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smith, V. Federated optimization in heterogeneous networks. Proc. Mach. Learn. Syst. 2020, 2, 429–450. [Google Scholar]
  23. Lian, X.; Zhang, C.; Zhang, H.; Hsieh, C.J.; Zhang, W.; Liu, J. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5336–5346. [Google Scholar]
  24. Li, X.; Yang, W.; Wang, S.; Zhang, Z. Communication efficient decentralized training with multiple local updates. arXiv 2019, arXiv:1910.09126. [Google Scholar]
  25. Yu, H.; Jin, R.; Yang, S. On the linear speedup analysis of communication efficient momentum SGD for distributed non-convex optimization. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; pp. 7184–7193. [Google Scholar]
  26. Wang, J.; Liu, Q.; Liang, H.; Joshi, G.; Poor, H.V. Tackling the Objective Inconsistency Problem in Heterogeneous Federated Optimization. Adv. Neural Inf. Process. Syst. 2020, 33, 7611–7623. [Google Scholar]
  27. Nguyen, H.T.; Sehwag, V.; Hosseinalipour, S.; Brinton, C.G.; Chiang, M.; Poor, H.V. Fast-convergent federated learning. IEEE J. Sel. Areas Commun. 2020, 39, 201–218. [Google Scholar] [CrossRef]
  28. Chen, M.; Yang, Z.; Saad, W.; Yin, C.; Poor, H.V.; Cui, S. Performance optimization of federated learning over wireless networks. In Proceedings of the 2019 IEEE Global Communications Conference (GLOBECOM), Big Island, HI, USA, 9–13 December 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–6. [Google Scholar]
  29. Luo, S.; Chen, X.; Wu, Q.; Zhou, Z.; Yu, S. HFEL: Joint edge association and resource allocation for cost-efficient hierarchical federated edge learning. IEEE Trans. Wirel. Commun. 2020, 19, 6535–6548. [Google Scholar] [CrossRef]
  30. Mhaisen, N.; Abdellatif, A.A.; Mohamed, A.; Erbad, A.; Guizani, M. Optimal User-Edge Assignment in Hierarchical Federated Learning Based on Statistical Properties and Network Topology Constraints. IEEE Trans. Netw. Sci. Eng. 2022, 9, 55–66. [Google Scholar] [CrossRef]
  31. Li, S.; Cheng, Y.; Liu, Y.; Wang, W.; Chen, T. Abnormal client behavior detection in federated learning. arXiv 2019, arXiv:1910.09933. [Google Scholar]
  32. Blanchard, P.; El Mhamdi, E.M.; Guerraoui, R.; Stainer, J. Machine learning with adversaries: Byzantine tolerant gradient descent. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
  33. Shen, S.; Tople, S.; Saxena, P. Auror: Defending against Poisoning Attacks in Collaborative Deep Learning Systems. In Proceedings of the 32nd Annual Conference on Computer Security Applications, ACSAC ’16, Los Angeles, CA, USA, 5–9 December 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 508–519. [Google Scholar] [CrossRef]
  34. Fang, M.; Cao, X.; Jia, J.; Gong, N. Local model poisoning attacks to Byzantine-Robust federated learning. In Proceedings of the 29th USENIX Security Symposium (USENIX Security 20), Boston, MA, USA, 12–14 August 2020; pp. 1605–1622. [Google Scholar]
  35. Li, S.; Cheng, Y.; Wang, W.; Liu, Y.; Chen, T. Learning to detect malicious clients for robust federated learning. arXiv 2020, arXiv:2002.00211. [Google Scholar]
  36. Li, X.; Huang, K.; Yang, W.; Wang, S.; Zhang, Z. On the Convergence of FedAvg on Non-IID Data. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  37. Salehi, M.; Hossain, E. Federated Learning in Unreliable and Resource-Constrained Cellular Wireless Networks. IEEE Trans. Commun. 2021, 69, 5136–5151. [Google Scholar] [CrossRef]
Figure 1. R-HFL system with cloud–edge–end cooperation: cloud and edge servers cooperatively generate global FL models from participating end devices. Detectors are deployed at the edges to filter malicious clients in parallel, which are built by the PCS mechanism that compares the local model with the aggregated edge model.
Figure 1. R-HFL system with cloud–edge–end cooperation: cloud and edge servers cooperatively generate global FL models from participating end devices. Detectors are deployed at the edges to filter malicious clients in parallel, which are built by the PCS mechanism that compares the local model with the aggregated edge model.
Electronics 12 00112 g001
Figure 2. Performance evaluation (i) with different FL frameworks as (a,b), and (ii) under different intensities of attacks as (c,d).
Figure 2. Performance evaluation (i) with different FL frameworks as (a,b), and (ii) under different intensities of attacks as (c,d).
Electronics 12 00112 g002
Figure 3. Effect of Threshold t for defending against (i) ANMP and (ii) SFMP attacks. (a) ANMP: global model performance varying with t, (b) ANMP: edge model performance varying with t, (c) SFMP: global model performance varying with t, (d) SFMP: edge model performance varying with t.
Figure 3. Effect of Threshold t for defending against (i) ANMP and (ii) SFMP attacks. (a) ANMP: global model performance varying with t, (b) ANMP: edge model performance varying with t, (c) SFMP: global model performance varying with t, (d) SFMP: edge model performance varying with t.
Electronics 12 00112 g003
Figure 4. Effect of Threshold t for defending against ANDD attack. (a) ANDD: global model performance varying with t, (b) ANDD: edge model performance varying with t.
Figure 4. Effect of Threshold t for defending against ANDD attack. (a) ANDD: global model performance varying with t, (b) ANDD: edge model performance varying with t.
Electronics 12 00112 g004
Table 1. Testing and detection accuracy varying with PCS threshold t in different scenarios.
Table 1. Testing and detection accuracy varying with PCS threshold t in different scenarios.
Scenario- t = 0 t = 0.5 t = 0.65 t = 0.8 t = 0.95 t = 1
without attacks (FL)89.11 (-)------
without attacks (HFL/R-HFL)88.71 (-)------
ANMP (HFL)40.55 (-)------
ANMP (R-HFL)-41.08 (71.04)86.42 (≈100)--86.28 (98.24)20.61 (≈50)
SFMP (HFL)1.16 (-)------
SFMP (R-HFL)-86.41 (≈100)86.41 (≈100)--86.16 (98.16)20.61 (≈50)
ANDD (HFL)84.69 (-)------
ANDD (R-HFL)-84.69 (≈50)84.70 (≈50)82.65 (83.48)86.63 (≈100)86.43 (98.64)20.61 (≈50)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhou, Y.; Wang, R.; Mo, X.; Li, Z.; Tang, T. Robust Hierarchical Federated Learning with Anomaly Detection in Cloud-Edge-End Cooperation Networks. Electronics 2023, 12, 112. https://doi.org/10.3390/electronics12010112

AMA Style

Zhou Y, Wang R, Mo X, Li Z, Tang T. Robust Hierarchical Federated Learning with Anomaly Detection in Cloud-Edge-End Cooperation Networks. Electronics. 2023; 12(1):112. https://doi.org/10.3390/electronics12010112

Chicago/Turabian Style

Zhou, Yujie, Ruyan Wang, Xingyue Mo, Zhidu Li, and Tong Tang. 2023. "Robust Hierarchical Federated Learning with Anomaly Detection in Cloud-Edge-End Cooperation Networks" Electronics 12, no. 1: 112. https://doi.org/10.3390/electronics12010112

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop