Cluster-Based Secure Aggregation for Federated Learning

: In order to protect each node’s local learning parameters from model inversion attacks, secure aggregation has become the essential technique for federated learning so that the federated learning server knows only the combined result of all local parameters. In this paper, we introduced a novel cluster-based secure aggregation model that effectively deals with dropout nodes while reducing communicational and computational overheads. Speciﬁcally, we considered a federated learning environment with heterogeneous devices deployed across the country. The computing power of each node and the amount of training data can be heterogeneous. Because of this, each node had a different local processing time, and the response time to the server is also different. To clearly determine the dropout nodes in this environment, our model clusters nodes with similar response times based on each node’s local processing time and location and then performs the aggregation on a pre-cluster basis. In addition, we propose a new practical additive sharing-based masking protocol to hide the actual local updates of nodes during aggregation. The new masking protocol makes it easy to remove the share of dropout nodes from the aggregation without using a ( t , n ) threshold scheme, and updates from dropout nodes are still secure even if they are delivered to the server after the dropout shares have been revealed. In addition, our model provides mask veriﬁcation for reliable aggregation. Nodes can publicly verify the correctness and integrity of the masks received from others using a discrete logarithm problem before the aggregation. As a result, the proposed aggregation model is robust to dropout nodes and ensures the privacy of local updates if at least three honest nodes are alive in each cluster. Since the masking process is performed on a cluster basis, our model effectively reduces the overhead of generating and sharing the masking value. For an average cluster size C and a total number of nodes N , the computation and communication cost of each node is O( C ), the computation cost of the server is O( N ), and the communication cost is O( N C ). We analyzed the security and efﬁciency of our protocol by simulating diverse dropout scenarios. The simulated results showed that our cluster-based secure aggregation outputs about a 91% learning accuracy regardless of dropout rate with four clusters for one hundred nodes.


Introduction
Federated learning [1], which performs machine learning on the entire training data distributed across multiple external nodes, has a great advantage in protecting user privacy for the data used for training.Each node's training data does not need to be collected or exposed; instead, each node uses its own training data to perform the learning using only the local training result.The federated learning server continuously interacts with the node's local learning parameters until the weight parameters converge.Here, secure aggregation is necessary to hide the real learning parameters of nodes from the federated server and the other nodes because the local learning model can be reconstructed with the local parameters, which is known as a model inversion attack [2][3][4].Secure aggregation allows the federated server to obtain the total sum and average of all the local parameters without knowing the actual local parameters.the aggregation, and only the sum of local parameters can be computed.Recommended aggregation is also robust for suspended dropout users.To deal with dropout nodes, Shamir's (t, n) threshold secret sharing scheme is commonly used to reconstruct the masks of dropout users.However, the computation and communication overhead due to the use of the scheme is very large.Moreover, a verifiable secret sharing scheme may be required to verify shares.The proposed mask-based aggregation does not use a (t, n) threshold secret sharing scheme to remove the random masks of dropout users, so all the overheads caused by using the secret sharing scheme are eliminated.In Section 3, we will describe in detail how our method generates a correct aggregated result for existing users.
The main contributions of this paper are summarized as follows: • We proposed a new cluster-based secure aggregation (CSA) strategy for federated learning with heterogeneous nodes that have different computing powers and different sizes of training data.

•
The CSA technique clusters with similar response times.We introduced a processing score to represent the processing time of each node and proposed a new grid-based clustering algorithm to cluster nodes with processing score and GPS information.
Consequently, since the server can determine a reasonable latency for a cluster, the CSA technique improves the overall throughput of federated learning while reducing false-positive dropouts.

•
We proposed a novel additive sharing-based masking scheme robust to dropout nodes without using a (t, n) threshold secret sharing scheme.In particular, it allows nodes to verify the integrity and correctness of the masks received from other nodes for reliable aggregation.It also keeps local parameters private if each cluster has at least three nodes that are honest (no collusion with the server) nodes.
We briefly review the related work in Section 2 and explain a concrete description of the proposed model in Section 3. The security of the suggested model is analyzed in Section 4, and the simulated performance is analyzed in Section 5. Finally, we conclude the paper in Section 6.

Related Work
The main goal of secure aggregation is to protect the privacy of each user's private data from the server's aggregation under the context of a central server and multiple mobile nodes.A classic approach for secure aggregation is to use secure multiparty computation (MPC) protocols such as Yao's garbled circuits [5], secret sharing [6], and homomorphic encryption [7][8][9][10].Protocols based on garbled circuits are suitable for a small number of parties, whereas protocols based on secret sharing or homomorphic encryptions can be applied to hundreds of users [14].Secret sharing has been actively used in recently proposed secure aggregation protocols since it can make the aggregation robust to dropout users.However, secret-sharing-based protocols have the drawback of high communication costs because each node creates n shares and shares them with other nodes.Homomorphic encryption allows the aggregation to be performed on encrypted data in a public key infrastructure.However, the main weakness of homomorphic encryption is that it is computationally expensive and needs an additional trusted party.Protocols [7,8] based on threshold additively homomorphic cryptosystems can handle dropout users but require additional trust assumptions.Pairing-based schemes [10] also require a trusted dealer to set up the keys.The protocol outlined in [9] computes the sum by sequentially performing one round of interactions between the server and each client, but it does not deal with dropout users.
In order to prevent information leakage divulged by analyzing the differences between uploaded parameters from clients, the concept of differential privacy (DP) has been proposed [11,12].K. Wei et al. [12] added artificial noises to parameters at the client side before aggregating.However, there is a tradeoff between the convergence performance and privacy protection levels, that is, a better convergence performance leads to a lower protection level.Another approach, which is very related to our solution, is to use pairwise masks to hide the local parameters from the server.The pairwise blinding approach has been previously proposed in the studies in [15][16][17][18] and suggests different ways for dealing with client failure.K. Bonawitz et al. [13,14] proposed notable additive masking-based secure aggregation for federated learning.Users hide their local updates using paired perturbations, which will be canceled in the aggregation, so the server only obtains the correct sum of all the local updates.In an improved study [14], they solved the tolerance problem to dropout users using Shamir's secret sharing scheme.Our proposed secret sharing is also based on pairwise masking by additive secret sharing, but the way of dealing dropout users is different.We did not exploit Sharmir's secret sharing.In addition, our scheme clusters mobile nodes by processing time and performs cluster-based aggregation for computation and communication efficiency.
In order to reduce the overheads of secure aggregation, J.So et al. [19] proposed turbo aggregation that performs a circular aggregation for multiple groups.The group-based aggregation can reduce communication overheads because the sharing of masks and other data for secure aggregation is restricted to group members rather than to all the other nodes.It is similar to our scheme in that the aggregation is carried out on a group basis, but the turbo aggregation randomly partitions groups and operates the group aggregation in a circular way.On the other hand, the cluster in the proposed model is determined according to the process time and location of each node, and the cluster aggregation is independently performed in parallel.A. R. Elkordy and A. S. Avestimehr [20] suggested another group-based aggregation method for federated learning that allows the use of heterogeneous quantization according to communication resources in order to improve the communication performance.Networks are partitioned into groups, and user updates are also partitioned into segments.In addition, different levels of quantization are applied on segments during the aggregation.C. Hu et al. [21] proposed a cluster-based distributed aggregation model to resolve the bottleneck of centralized aggregation.The edge devices are grouped into clusters, and the aggregation is performed by some edge devices selected as cluster heads.Then, they are finally aggregated by the server.The authors proposed algorithms to find clusters that can minimize the maximum aggregation overhead.This is similar to our method in terms of the cluster-based aggregation.However, in our model, the server is trusted, so the aggregation is performed only on the server to assure the reliability of the aggregation.We believe that additional verifiable aggregation techniques are required to enable aggregation by edge devices.So, our model performs cluster-based aggregation to reduce the overhead, but the aggregation is performed only on the server.Another approach to reduce the overhead is to use a gradient sparsification method that sends only k parameters from the local gradient to the server.S. Lu et al. [22] proposed a top-k sparsification method for secure aggregation, which sends masked top-k parameters without exposing the coordinate information of the top-k parameters in a user model.However, there is a tradeoff between the accuracy of the training and communication efficiency.
Studies dealing with Byzantine users [23][24][25] in the learning model and studies on verifiable aggregation model [26,27] have also been proposed.L.He et al. [23] proposed a Byzantine robust secure gradient descent algorithm for a two-server model.It needs two honest and non-colluded servers, each of which carries out a secure two-party interactive protocol with mobile nodes.Z. Zhang et al. [25] proposed a lightweight Byzantine robust model with two servers.Each local result of the nodes is uploaded to two servers using a secret sharing method, so it achieves both local result protection and secure Byzantine robustness.On the other hand, J.So et al. [24] proposed a distance-based outlier detection approach for single-server federated learning, which can calculate the pairwise distance between local updates by sharing the masked updates and pairwise distances between the masked shares.This approach actually detects outliers and can work well if every node generates similar local updates.We do not think that outliers are necessarily Byzantine users.It is very challenging to detect Byzantine users who manipulate their updates in masked updates aggregation.We did not consider Byzantine users in this paper and left it for our future work.Regarding verifiable aggregation, Z. Yang et al. [26] provided a verification method to validate the weighted average aggregation result using a homo-morphic hash function under the assumption that the federated learning server cannot be fully trusted.C. Hahn et al. [27] suggested a way to verify the correctness of local updates under a cross-device federated learning model.Our model basically assumes that all participants are honest-but-curious, so they do not manipulate the outputs.Instead, our model provides a verification method to validate the correctness and integrity of mask values shared between users.As mentioned above, verifying the correctness and validity of user local updates was not covered in this paper and will be performed in future work.

A Cluster-Based Secure Aggregation Model
Now, we describe the proposed cluster-based security aggregation (CSA) model in detail.First, the federated learning structure, system environment, and threat model assumed in this paper are briefly explained.In addition, we present the security requirements and define main functions constituting the CSA protocol, and then the detailed protocols for node clustering and secure aggregation algorithms are provided.

Background and Configuration
In this section, we briefly present the federated learning architecture used in this paper and describe our assumptions, system configuration, and notations.The federated learning system consists of a single central federated learning server and N mobile users (or nodes).In the rest of the paper, FS and U denote the federated learning server and a set of nodes, respectively, and each node is denoted by u i .The local dataset (training data) of u i has the property of non-IID and unbalanced distribution.
(1) Federated learning The FS trains a global model w ∈ R d with the dimension d using the data stored in mobile devices.This training process is used to minimize a global objective function F(w): where Here, N is the total number of mobile nodes, F i is the local objective function of u i , n i is the private data size of u i , and n = ∑ i n i .The local objective function F i (w) of u i for the global model w is defined as: where f i (w) = l(x i , y i ; w).f i (w) is the loss of the prediction on example (x i , y i ) made with model parameters w [1].
For a fixed learning rate η, the FS trains the global model by iteratively performing the distributed stochastic gradient descent (SGD) method with currently available mobile nodes.At iteration t, the server shares the current global algorithm state (e.g., the current model parameters), w t , with the mobile nodes.Each u i then computes ∇F i w t , which is the average gradient on its local data at the current model w t and generates its local update w t+1 i : u i iterates the local update multiple times before sending the update to the FS.Then, the FS aggregates these gradients and updates the global model for the next iteration: Since the loss gradient ∇F w t can be rewritten as a weighted average across nodes, We basically assumed that all communication goes through the FS.Thus, all messages between nodes are first forwarded to the FS, and the FS sends them back to the corresponding nodes.However, this can be directly extended to node-to-node communication depending on the communication environment.
(3) System parameters The FS generates a big prime number p, a public-key cryptosystem, and the server's private and public key pair, which are denoted as <K S − , K S + >.PKE(K + , M) represents a public key encryption for a message M with a public key K + , and PKD(K − , C) denotes a public key decryption algorithm for a ciphertext C with a private key K − .(4) Registration Before participating in federated learning, each node registers with the FS.The node creates an ID and a pair of private and public keys, and then registers it with the FS.The FS finally creates a shared symmetric key between the node and the FS, which is used for efficient data encryption between them.The shared key can be generated by the Diffie-Hellman key exchange protocol or randomly generated by the FS.In our model, the FS generates a random key, encrypts it with the node's public key, and sends it to the node.The FS securely manages IDs, public keys, and shared keys.
Table 1 summarizes the notations used in the rest of paper.The local update of the u i for w t G i ,j , G The grid of row i and column j, where G is a set of all grids.
U i ,j A set of nodes mapped to the grid G i,j C i , C The i-th cluster, where C is a set of all clusters.r i , R i A random integer that the FL server assigns to C i and R i = g r i mod p m j ,k , M j,k A random mask that u j generates for u k where M j,k = g m j,k mod p

S i
The share of u i for secure aggregation

IS i
The intermediate sum of C i

Problem Definition
The federated learning model protects the privacy of training data by letting the private training data remain on the user's mobile device.Instead, it requires only the local model parameters trained by each mobile device.However, it has been recently demonstrated that a server can still reconstruct the private data from local models using a model inversion attack [2][3][4].Thus, secure aggregation, which does not expose the local model parameters during the server's aggregation for the federated learning, is necessary.The core of secure aggregation is for the federated learning server to know only the aggregated sum S = ∑ N i=1 w i of the local model parameters w i without knowing any information about w i .Another major issue to consider in secure aggregation is dropout users.Users who initially participate in federated learning may be dropped out during the federated learning due to network or device issues.Therefore, secure aggregation must tolerate dropout users, that is, the aggregation should be able to correctly reconstruct the aggregated sum of the currently available users, even if dropout users occur.In addition, the decision on dropout can be erroneous, and a dropout user's response is not actually dropped out, but delayed, so it may be delivered to the server after the aggregation.Even in this case, no one should know any information about the local model of the dropout user from the delayed response.As mentioned earlier, it is not easy to distinguish between message drop and delay during each round of aggregation.Federated learning requires iterative aggregations, and the server cannot wait indefinitely for all responses from every single aggregation.Therefore, the server must be able to determine a reasonable latency to terminate a single aggregation, and this can be also used to distinguish between message drop and delay.The proposed CSA solves this problem simply and practically.After all, the communication latency to the server is mainly affected by the processing time of each device and the network conditions.The processing time of each device is determined by the size of the local dataset and the computational power required to train the dataset.On the other hand, it is not easy to measure the actual communication conditions between each mobile node and the server.Instead, we exploited two assumptions, i.e., (1) the processing delay of each device dominates over the communication delay and (2) the communication delay is proportional to the distance to the server.Accordingly, rather than measuring the actual network state, the geographic location information of each device is simply used to estimate the distance to the server.Based on this, the key strategy of the CSA technique first clusters mobile devices according to their processing time and geographic location and then performs the aggregation on a cluster basis.Since the processing time and communication distance of each node in a cluster can be estimated, the server can determine the appropriate latency for each cluster.In addition, the cluster-based strategy reduces the computational and communicational overheads for the secure aggregation because each node needs additional computation and communication to share random masks only for the same cluster members.
We considered an honest-but-curious threat model.All participants, including the FS and mobile nodes, act honestly in the communications.They do not deviate from the defined protocol but attempt to learn all the possible information from legitimately received messages.Under this threat model, the proposed aggregation is satisfied with the following security requirements:

•
Privacy of local datasets and updates: All the data that each node holds in its local device and all the local learning parameters that are shared over the network must be confidential not only to the other nodes but also to the FS.The FS only knows the aggregated sum of all the local updates provided by all nodes.In addition, even if a particular user's update would be delivered to the FS after the aggregation, the FS and the other users cannot reconstruct the corresponding user's local parameters with the delayed data.

•
Tolerance to dropouts: User updates can be dropped out during the communication due to network conditions and device issues.The FS should be able to compute a correct aggregated sum of the current active users even if dropout users occur.

•
Integrity of random masks: Users create random masks and share them with other users to hide their actual local model parameters.In addition, these masks must be correctly removed during the local update aggregation.Therefore, users should be able to validate the correctness and integrity of given masks.In other words, users can be sure that the masks are created to be necessarily removed during the aggregation and that the masks are not modified during the communication.
We now describe the proposed CSA protocol and its main functions.The CSA protocol consists of two main parts: node clustering and secure aggregation.The node clustering is performed once at the beginning of the federated learning.It simply clusters nodes according to their processing scores and GPS information.After the clustering, the FS repeats the secure aggregation with nodes.The secure aggregation consists of the following key functions: (1) Quantization: The CSA exploits an additive secret sharing defined over a finite field for a prime p.Thus, all operations in the CSA are carried out over Z p * .Since the local parameters are real numbers, the nodes need to quantize the real values to integers.To achieve this, we exploited So et.al.'s stochastic quantization strategy [24].
(2) Masking with additive secret sharing: The FS selects a random nonce for each cluster.
Then, each node generates random masks for the other nodes in the same cluster and shares the encrypted masks with them.The random masks are created by an additive secret sharing method based on the cluster random nonce.Then, the nodes create their updates masked with those shares.(3) Aggregation: The updates of nodes are first aggregated on a cluster basis.When dropout users occur in a cluster, the currently available nodes in the cluster perform a recovery phase.They modify the cluster sum by removing the masks of dropout users from the aggregated sum.After the recovery phase, the cluster sums are finally aggregated.
The detailed protocol for each function is described in the following sections.

Node Clustering
In this section, we describe the node clustering in detail.To achieve this, we first defined a processing score representing the total processing time of each node.Based on the size of dataset, the computation capacity CC and the processing score PS are defined as per Equation ( 5 Let u i 's processing score be PS i .PS i 's PS level is simply determined by the following Equation (6): where K is the number of PS levels.
With the PS level, FS generates a grid, G, for nodes.For the entire area containing all nodes, the FS partitions the area into a × b cells for the predefined parameters a and b.G i,j denotes a cell at the i-th row and j-th column.The FS and nodes are mapped to the grid cells based on their GPS information.U i,j denotes a set of nodes mapped to G i,j .Figure 1 shows an example of node distribution over a 6 × 5 grid.The number in a circle indicates the PS level of each node.
The FS creates K clusters that are initially empty.C i denotes the cluster of level i for 1 ≤ i ≤ K.The FS proceeds the node clustering by expanding the range of the neighbor around the server.Suppose that the cell where the FS is located is G rf,cf .The initial neighbor is G rf,cf itself.For all the nodes in the neighbor, each node is allocated to a cluster of the same level as its own PS level.That is, if the PS level of a node is l, then the node is assigned to C l .After the first round of clustering is complete, the next neighbor is selected with all the cells surrounding G rf,cf with an index difference of 1.These are  The FS creates K clusters that are initially empty.Ci denotes the cluster of level i for 1 ≤ i ≤ K.The FS proceeds the node clustering by expanding the range of the neighbor around the server.Suppose that the cell where the FS is located is Grf,cf.The initial neighbor is Grf,cf itself.For all the nodes in the neighbor, each node is allocated to a cluster of the same level as its own PS level.That is, if the PS level of a node is l, then the node is assigned to Cl.After the first round of clustering is complete, the next neighbor is selected with all the cells surrounding Grf,cf with an index difference of 1.These are Grf−1, cf−1, Grf−1, cf, Grf−1, cf + 1, Grf, cf−1, Grf, cf+1, Grf+1, cf−1, Grf+1, cf, and Grf+1, cf+1.In this way, the next neighbor is changed by increasing the index by 1, and the node clustering is repeated for all nodes in the next neighbor.The blue dotted line in Figure 1 represents the neighbor in each round.As shown in the figure, u1, u2, and u3 are clustered in the first round according to the PS level, and then u4 is clustered in the second round.The rest of nodes are clustered in the same way.
Here, for each cluster, the nodes are hierarchically organized in the order in which they were clustered.Although all the nodes belonging to the same cluster have the same PS level, the distance to the FS is different, so hierarchical organization is required to classify the distance to the FS in the cluster.The detailed node clustering algorithm is shown in Algorithm 1 below.Here, for each cluster, the nodes are hierarchically organized in the order in which they were clustered.Although all the nodes belonging to the same cluster have the same PS level, the distance to the FS is different, so hierarchical organization is required to classify the distance to the FS in the cluster.The detailed node clustering algorithm is shown in Algorithm 1 below.

Algorithm 1: Node Clustering
Input: G: a grid of nodes, mr: the maximum row index, mc: the maximum column index, rf : the row index of the server cell, cf : the column index of the server cell, U = {u 1 , . . ., u N } and each u i 's GPS information and PS level.Output: K clusters C 1 , . . ., C K create K empty clusters C 1 , . . ., C K , where each C i is hierarchically structured with an initial bucket B = ∅; for each node u in U generate U i ,j using each u's GPS information and G; for (D = 0; (rf + D <= mr||cf + D <= mc); D++) for each cluster Figure 2 shows the results after clustering all nodes in Figure 1.Since the nodes of each cluster are hierarchically structured in the order that they are clustered, the nodes of C 1 have a three-tiered structure, and the nodes of C 2 have a two-tiered structure.On the other hand, C 3 , C 4 , and C 5 do not satisfy the minimum cluster size.Therefore, an additional cluster rebalancing process is required for all the clusters to meet the minimum cluster size requirement.
Figure 2 shows the results after clustering all nodes in Figure 1.Since the nodes of each cluster are hierarchically structured in the order that they are clustered, the nodes of C1 have a three-tiered structure, and the nodes of C2 have a two-tiered structure.On the other hand, C3, C4, and C5 do not satisfy the minimum cluster size.Therefore, an additional cluster rebalancing process is required for all the clusters to meet the minimum cluster size requirement.Once the clustering is complete, the FS forms the final clusters by merging the clusters whose size is smaller than δ.The merge begins from the cluster level K and repeats the following process.If C K does not satisfy δ, and if C K−1 is sufficiently big, then some nodes of Figure 3 shows the final results after the cluster merging.Figure 3 shows the final results after the cluster merging.First, |C5| is 1 and |C4| is 1, so u11 is merged into C4.C5 is discarded.Next, |C4| is still 2 and less than 4. |C3| is also 1, so u9 and u11 in C4 are merged into C3.C4 is discarded too.Now |C3| is 3, but it lacks one node to meet the requirements.|C2| is sufficiently big, although one node is moved to C3, so u13 with the longest distance to the FS is merged into C3.Since both C2 and C1 meet the requirements, the merge process is terminated.Consequently, C1, C2, and C3 are finally created, and C1 contains nodes <u1, u3, u5, u10, u7>, C2 contains nodes <u2, u4, u6, u12>, and C3 contains nodes <u2, u4, u6, u12>.After finishing the node clustering, the FS determines an appropriate latency (or latency level) for the dropout decision for each cluster.This can be set to three times the shortest response time in each cluster.The FS then sends the cluster ID and the list of the nodes in the cluster to each node.The node list contains the node ID and the public key pair of each node belonging to the same cluster.After finishing the node clustering, the FS determines an appropriate latency (or latency level) for the dropout decision for each cluster.This can be set to three times the shortest response time in each cluster.The FS then sends the cluster ID and the list of the nodes in the cluster to each node.The node list contains the node ID and the public key pair of each node belonging to the same cluster.

BCSA: A Basic Cluster-Based Secure Aggregation Model
Next, we describe the basic cluster-based secure aggregation protocol in detail.This protocol is denoted as BCSA in the rest of the paper.Aggregation of the local updates of the nodes is conducted on a cluster basis, and the FS eventually aggregates all the intermediate sums of the clusters.BCSA is defined by the following steps: Step 1: Assignment of training weights and random nonce Step 1 is performed only once before beginning the aggregation.The FS computes training weights for nodes and chooses random nonces for clusters.Let C i be the i-th cluster and |C i | be the size of C i , which is the number of nodes belonging to C i .Since each node has a different training data size, the FS assigns training weights to the nodes according to the training data size.Let λ j be the training weight of u j in C i .λ j is determined as follows: where n j is the data size of u j .
The FS also chooses a random integer r i mod p for C i , which is later used to generate random masks at each node.The FS delivers r i along with λ j to u j in i as encrypted with the public key of u j .The FS also publishes R i = g r i mod p to all nodes so that every node can verify the validity of r i .
After Step 1 is finished, the following steps are repeated until the federated learning ends.
Step 2: Quantization The local parameters, w j , of each node are converted to integers by the stochastic rounding function proposed in [24].For any integer q ≥ 1, the stochastic rounding function is as follows: where [x] is the largest integer less than or equal to x, and q is the number of quantization levels.For a mapping function φ : R → F p , the quantized model is defined as follows: where Step 3: Random masks generation and distribution Each u j in C i generates random masks for all the other nodes in C i .Let m j,k be a random mask for u k where j and k are 1, . . ., |C i |. u j chooses m j,k satisfying the following Equation (10): To achieve this, u j firstly chooses l − 1 random positive integers m 1 , . . ., m l−1 for modular p where l = |C i | − 1 and then determines the final random integer m l as Here, m l is not a value for modular p and can be negative.Thus, m l is represented in the form of x•(p − 1) + r for an integer x and a positive residue r.Then, m l := r ≡ m l mod (p − 1).u j also generates a public mask M j,k for m j,k 's validity verification as follows: where m j,k = m j,k mod (p − 1).u j encrypts m j,k with u k 's public key and publishes all the encrypted masks and public masks.The FS sends them back to nodes in the C i .If m k,j < 0, u j sets m k,j = m k,j mod (p − 1).Finally, u j accepts m k,j if the following equation holds: If any mask is invalid or dropped during the communication, Step 3 is repeated until all the nodes share valid masks.
Step 4: Secure update generation If all the masks are valid, u j generates its secure update S j for its local weight W j as follows and sends S j to the FS: Step 5: Cluster aggregation For each cluster C i , a different level of latency is determined to collect S j .After collecting S j for C i , the FS determines a list of currently available users.If all nodes in C i are available, the FS sends a message of "all available" to the nodes and computes the intermediate sum IS i of C i as follows: and where φ −1 (•) is the dequantization function.
Eventually, IS i is the average of the local weights w j of all nodes.
Step 5-1: Recovery of aggregation: removing masks of dropout users For each cluster C i , if there are any dropout users, the FS determines a list of the current available users denoted as A i and sends it back to all nodes.Each u j in A i replies with its signed confirmation messages to the FS, and the FS broadcasts all the signatures to all nodes.If all the signatures are valid, the currently available nodes carry out the recovery phase by removing the masks of the dropout users.
For example, suppose that u 1 ~u4 belong to a cluster C 1 and that u 2 and u 4 are dropout users, thus, S 2 and S 4 are dropped in the process of Step 4. The sum of S 1 and S 3 is defined as follows: In order to recover the sum of w 1 and w 2 , additional values of m 1,2 , m 1,4 , m 2,4 , m 3,2 , m 3,4 , and m 4,2 are necessary.Here, m 1,2 , m 1,4 , m 3,2 , and m 3,4 are created by u 1 and u 3 , so those values can be easily obtained from u 1 and u 3 .However, m 2,4 and m 4,2 are created by u 2 and u 4 , respectively, and are delivered to u 2 and u 4 , so u 1 and u 3 cannot know those values.This can be solved without revealing m 2,4 and m 4,2 directly.For u 2 and u 4 , u 1 and u 3 compute their reconstruction values RS 1 and RS 3 using the masks shared with u 2 and u 4 as follows: Consequently, a new TS 1 represents the sum of W 1 and W 3 correctly as follows: We formalized the above.The nodes in C i can be divided into two groups: active user group A i and dropout user group D i .If D i = ∅, the nodes in A i compute their reconstruction values RS j using Equation ( 14) and send them to the FS: Then, the sum of all the updates of the final active users is computed by the following Equation ( 15): Here, the training weight of each active user is a value calculated by considering the data of the dropout users, so it should be modified as a training weight for the data of only active users.Thus, the final sum of the active users is determined by the following Equation ( 16): Step 6: Final aggregation Lastly, the FS obtains the S of all nodes by S = ∑ IS i for all clusters, where |C| is the number of clusters and n_A i = ∑ ∀u j ∈A i n j .S is the average of the local weights w i of all active nodes.

FCSA: A Fully Secure Cluster-Based Aggregation Model
BCSA works correctly for dropout nodes.BCSA, however, allows the FS to know the actual local parameters of false dropout nodes when the nodes are determined to dropouts, but their updates are delivered to the FS after aggregation.For an example, suppose that u d was determined as a dropout and u d 's update S d is delivered to the FS after the other active users perform the recovery phase.In this case, the FS can know the actual local parameter W d of u d just by adding the reconstruction values of the active users to S d of u d .To solve this vulnerability, a fully secure cluster-based secure aggregation protocol was proposed.It is denoted as FCSA in the rest of the paper.FCSA uses another random secret chosen by a user to generate the secure update.
During the masking generation stage (Step 3 in Section 3.4), u j in C i chooses another random value α j .At the secure update generation stage (Step 4 in Section 3.4), u j generates its secure update S j as follows: Unlike BCSA, the recovery phase is always necessary for all nodes to eliminate α j , regardless of dropouts.If there are no dropouts, all nodes send their α j to the FS.Otherwise, each active user u j in A i sends its reconstruction values for the dropout nodes along with α j .Even if u d 's S d is delivered to the FS after the recovery phase, the actual local parameter W d is still secure because it is hidden by u d 's random secret α d .
Here, additional dropouts can occur even in the recovery phase.Let such a node be u x , that is, the S x of u x is passed to the FS normally but the reconstruction value RS x and secret α x of u x are dropped out in the recovery phase.If u x turns out to be a dropout, the final active users send back the reconstruction values for all dropout users including u x by Equation ( 14), and then the FS can compute the sum of the secure updates for the final active users by the following Equation ( 18): Even if a pair of RS x and α x of u x is delivered to the FS late, u x 's local data W x is still secure because S x is still masked with the random values given from other nodes.We summarized the operation of FCSA with a simple example.For a cluster C 1 containing five nodes denoted as u 1 , . . ., u 5 , suppose that u 2 and u 4 turned out to be a dropout in the initial cluster aggregation phase.Thus, S 1 , S 3 , and S 5 have been passed to FS.Then, u 1 , u 3 , and u 5 perform the recovery phase and should send their random secrets α 1 , α 3 , and α 5 and their reconstruction values RS 1 , RS 3 , and RS 5 generated by Equation ( 14) to the FS.RS 1 , RS 3 , and RS 5 are determined as follows: Here, suppose that α 3 and R 3 are not delivered to the FS normally.Another dropout happens in the recovery phase.Then, the final active users u 1 an u 5 repeat the recovery phase by sending the modified reconstruction values R 1 (2) and R 5 (2) for all the dropout users u 2 , u 4 , and u 3 to the FS.

Security and Efficiency Analysis
In this section, we analyze the theoretical results of FCSA in terms of its robustness for dropout nodes, privacy of local parameters, and computational and communication efficiency of aggregation.This can be summarized as follows: (1) Robustness to dropouts: FCSA is robust against dropout users.(There is no constraint for the number of active users.)(2) Privacy of local parameters: FCSA guarantees the privacy of local parameters on each node if there are at least three honest active users in each cluster with a cluster size greater than or equal to four.Next, we will first show that FCSA is robust against dropout users.To prove this, we will show that Equation ( 18) correctly derives the aggregated sum of the final active users.
For a cluster C and its random number r, let A be the group of final active users and D be the group of dropout users.
Next, we will show that FCSA guarantees privacy for each node's local parameters.We basically assumed honest-but-curious nodes and the FS.We assumed that they do not manipulate or forge their data but that some of them can collude with the FS.Any collusion node can provide its masking values to the FS.Theorem 1.For each cluster C that satisfies |C| ≥ 4, if there are at least three honest (no collusion with FS) active users, FCSA guarantees the privacy of each node's local parameters.
Proof.Let the non-collusive active user group be H = {h 1 , h 2 , h 3 } and the collusion user group be B = U − H. Let S i be the secure update of n i in A. S i be defined as follows: All the nodes in A also provide RS i and α i .Any node in B can provide all the masking values about the other nodes.Thus, the FS and the nodes in B can compute S i as follows: S i is still masked by the random secrets generated by honest nodes.m i,h and m h,i are securely shared between only n i and n h .Suppose that the FS has published S i to all the nodes in U. Here, we can consider two cases: (1) n i belongs to B and n h belongs to H, and (2) both n i and n h belong to H.For the first case, S i is determined as follows: All the nodes in B except n i cannot know any masking value m i,h or m h,i because all the honest nodes do not open their masking values.Any honest node n h in H can know m i,h or m h,i that n h possesses but it cannot know the masking values of the other honest nodes.Therefore, it is impossible to reveal the actual w i from all the opened information.
For the second case, let the three honest nodes be h, h 1 , and h 2 .The secure update of h is denoted as S h , which is defined as All the nodes in B cannot know the masking values m h,h j that are shared between only the honest nodes.Thus, it is secure against all the nodes in B. For the other honest nodes h 1 and h 2 , h 1 can know m h,h 1 and m h 1 ,h because h 1 possesses them, but it cannot know m h,h 2 and m h 2 ,h .This is the same for h 2 .Therefore, for both cases, W i is always secure to all other nodes if there are at least three honest active nodes.
If the number of honest active nodes is less than three (e.g., two honest nodes marked h and h 1 ), S h is defines as When S h is opened to h 1 , h 1 can know the value of W h because h 1 also possesses m h,h 1 and m h 1 ,h .Therefore, at least three honest nodes are necessary to guarantee the privacy of the local parameters.
Lastly, we analyzed the efficiency of FCSA.Table 2 summarizes the computation and communication costs of the nodes and the FS for each main operation of FCSA.Node clustering is performed only once by the FS at the beginning of the federated learning.After mapping the nodes to a grid according to the GPS, the FS assigns clusters sequentially from the nodes closest to the FS.The computational cost for the node clustering is O(N).During the aggregation, the main computational operations of each node side are mask generation, secure update generation, and reconstruction value generation.Since the aggregation is performed on a cluster basis, the computation costs for all these operations are proportional the size of the cluster (the number of nodes belonging the cluster).On the other hand, the main computational operations of the FS are training weight computation and the aggregation of local updates.In addition, all these operations are required for every node.Thus, the computational cost of the FS is proportional to the number of nodes.Therefore, the total computation costs of the nodes and the FS are O(C) and O(N), respectively, where C represents the average cluster size.Next, in our model, all communications go through the FS.The most communicationally expensive operation of the node side is to send masks to the FS.For a cluster C, each node in C sends (|C| − 1) masks to the FS.Thus, the communication cost of a node is proportional to the size of cluster and is O(C).On the side of the FS, the most communicationally expensive operation is to distribute the masks to all nodes.The FS sends (|C| − 1) masks to each node in each cluster C. Since the FS must repeat this operation for all clusters, the FS eventually sends (C − 1) masks to N nodes.Thus, the communication cost of the FS is O(NC).

Experimental Results
In this section, we analyzed the simulated performance of the proposed CSA.We particularly evaluated the accuracy and overall processing time of our model using the MNIST [28] database for various federated learning situations.We compared the accuracy of our cluster-based learning model with a single centralized learning model according to quantization levels, training weights, and different dropout situations.In addition, we analyzed the overall processing time under various dropout situations.

Simulation Setup
We used the MNIST database for our experiment.The MNIST database contains 28 × 28 grayscale images of 10 digits and consists of a training set of 60,000 images along with a test set of 10,000 images.For an individual training run, a two-layer CNN model with 5 × 5 convolution layers (the first with sixteen channels, the second with thirty-two channels, each followed with 2 × 2 max pooling), ReLU activation, and a final softmax output layer was used.We conducted our experiments using Python and implemented the learning architecture using PyTorch framework.
The MNIST data were distributed into 100 nodes in a non-IID way.First, the nodes were evenly divided into four clusters (twenty-five nodes per cluster) with different PS levels.The PS level, denoted as PSL, had a value from 1 to 4. PSL 1 represents the group with the shortest response time, and PSL 4 represents the group with the longest response time.Accordingly, the nodes in the PSL 1 cluster were allocated a small amount of training data, whereas the nodes in the PSL 4 cluster were assigned a relatively large amount of training data.So, we assigned 100 randomly chosen examples to each node with PSL 1, 400 examples to each node with PSL 2, 700 examples to each node with PSL 3, and 1000 examples to each node with PSL 4. In total, 55,000 data items were used for the training in the experiment.
The client nodes were implemented by creating 100 threads on a PC configured with RTX 3080 GPU, Intel(R) i7-12700K 3.61GHZ CPU, 32 GB memory, and Windows 11.The server PC was configured with an RTX 3080 GPU, Intel(R) i9-11900F 2.50GHZ CPU, 1TB SSD, 64 GB memory, and Windows 11.Since the nodes were created as threads on the same PC, the actual communication latency and computing power of the nodes were the same.Only the actual local training times of the nodes were different due to the different sizes of the training data.Therefore, the single round latency per cluster was arbitrarily set to 20 s for the PSL 1 cluster, 25 s for the PSL 2 cluster, 30 s for the PSL 3 cluster, and 35 s for the PSL 4 cluster.This represented the maximum waiting time when dropout occurs.If dropout did not occur, the next round was performed as soon as responses from all nodes were collected.For the federated learning, the FS basically performed 100 iterations of local weight updates with nodes.The main experimental parameters and values are summarized in Table 3 below.

Parameters Values
The total number of nodes/ the number of nodes per each cluster

Simulated Performance
In the experiment, FCSA with four clusters (C = 4) was used as the cluster-based federated learning model, while FCSA with a single cluster (C = 1) was used as the single centralized learning model.Before analyzing the accuracy of FCSA, when we simply performed the federated learning on 100 nodes without applying the proposed CSA, the accuracy was about 90.5%.Since FCSA requires quantization for mask sharing and validation, we first analyzed the accuracy of FCSA according to the quantization level.Figure 4 shows the accuracy of the proposed FCSA according to different quantization levels (qLevel).The simulated results showed the best accuracy when qLevel was 300 for both situations, and FCSA with C = 4 showed better accuracy than FCSA with C = 1.When no dropout occurred, the accuracy of FCSA with C = 4 was 91.08%, while that of FCSA with C = 1 was 87%.On the other hand, when qLevel was 100, the accuracy of FCSA with C = 4 was 88.3%, but the accuracy of FCSA with C = 1 was about 40%.From the results, we can see that the quantization level greatly affected the learning accuracy.However, using an appropriate quantization level could prevent the loss of accuracy due to quantization, as shown by the results showing that the accuracy of FCSA with C = 4 at a qLevel of 300 was better than that of the simple federated learning without quantization.In the federated learning model, since the local data size of each device was small, relatively small quantization values were sufficient to output accurate learning results.However, the centralized learning model required large quantization values, as it worked with the full data collected from the local devices.Table 4 shows the accuracy according to the dropout rate (dr).In this experiment, for FCSA with C = 1, the nodes equal to the dropout rate among all the nodes were randomly removed from the federated learning.On the other side, in the case of FCSA with C = 4, the dropout nodes were selected per cluster.In other words, if dr = 30%, for each cluster, 30% of the nodes were randomly removed from the learning.It should be noted that the accuracy of FCSA with C = 1 decreased as the dropout rate increased, whereas the accuracy of FCSA with C = 4 was not significantly affected by the dropout rate.FCSA with C = 4 outputted similar accuracies regardless of the dropout rate.This result was meaningful.In both cases, the actual amount of training data used for the federated learning was the same, but the accuracy was different.In the case of FCSA with C = 4, the data size used for local learning was different for each cluster.However, even if dr = 50%, it was noted that the accuracy of the overall learning did not deteriorate because the remaining nodes still produced good learning results in the C3 and C4 groups that performed local learning with relatively large amounts of data.Table 4 shows the accuracy according to the dropout rate (dr).In this experiment, for FCSA with C = 1, the nodes equal to the dropout rate among all the nodes were randomly removed from the federated learning.On the other side, in the case of FCSA with C = 4, the dropout nodes were selected per cluster.In other words, if dr = 30%, for each cluster, 30% of the nodes were randomly removed from the learning.It should be noted that the accuracy of FCSA with C = 1 decreased as the dropout rate increased, whereas the accuracy of FCSA with C = 4 was not significantly affected by the dropout rate.FCSA with C = 4 outputted similar accuracies regardless of the dropout rate.This result was meaningful.In both cases, the actual amount of training data used for the federated learning was the same, but the accuracy was different.In the case of FCSA with C = 4, the data size used for local learning was different for each cluster.However, even if dr = 50%, it was noted that the accuracy of the overall learning did not deteriorate because the remaining nodes still produced good learning results in the C 3 and C 4 groups that performed local learning with relatively large amounts of data.Figure 5 shows the difference in the accuracy between applying and not applying the training weights (λ) to FCSA with C = 4.When the training weights were not applied to FCSA with C = 4, the accuracy dropped slightly to 90.01%.Since FCSA with C = 4 was not affected by the dropout rate per cluster, the accuracy was compared when cluster-based dropout occurred.Figure 6 shows the results.When all the nodes belonging to C 1 were dropped out, the accuracy was around 91.8%.From this, it can be seen that the local learning result of the nodes belonging to the C 1 cluster had little effect on the overall learning, and the overall learning accuracy was rather increased by the nodes belonging to the C 3 and C 4 clusters based on sufficient data.On the other hand, when C 4 was dropped out, the accuracy dropped to 89.4%.When the training weights were not applied and C 4 was dropped out, the accuracy decreased to 87.11%.From the experiment results, we can conclude that an accurate local model based on sufficient data was the most important factor for increasing the accuracy of the overall federated learning.Lastly, we analyzed the total running time according to different dropout situations.Figure 7 shows the overall running time (100 rounds run time) when cluster-based dropout occurred.When C1 was dropped out, the overall time was about 1796 s, and when C4 was dropped out, it was about 4588 s.This was because the latency was different depending on the cluster level, and the total execution time increased proportionally accordingly.Figure 8     Lastly, we analyzed the total running time according to different dropout situations.Figure 7 shows the overall running time (100 rounds run time) when cluster-based dropout occurred.When C1 was dropped out, the overall time was about 1796 s, and when C4 was dropped out, it was about 4588 s.This was because the latency was different depending on the cluster level, and the total execution time increased proportionally accordingly.Figure 8   Lastly, we analyzed the total running time according to different dropout situations.Figure 7 shows the overall running time (100 rounds run time) when cluster-based dropout occurred.When C 1 was dropped out, the overall time was about 1796 s, and when C 4 was dropped out, it was about 4588 s.This was because the latency was different depending on the cluster level, and the total execution time increased proportionally accordingly.Figure 8 compares the total execution time of FCSA with C = 1 and FCSA with C = 4.When dropout did not occur, the running time of FCSA with C = 1 was about 1878 s, and the running time of FCSA with C = 4 was about 1815 s.The difference was negligible.When dropout occurred, the running time of FCSA with C = 1 was about 4872 s, while the running time of FCSA with C = 4 was 4011 s.FCSA with C = 1 applied the longest latency to all dropout nodes, whereas FCSA with C = 4 applied a different latency to dropout nodes depending on the cluster, so the total execution time of FCSA with C = 4 was relatively reduced.Lastly, we analyzed the total running time according to different dropout situations.Figure 7 shows the overall running time (100 rounds run time) when cluster-based dropout occurred.When C1 was dropped out, the overall time was about 1796 s, and when C4 was dropped out, it was about 4588 s.This was because the latency was different depending on the cluster level, and the total execution time increased proportionally accordingly.Figure 8 compares the total execution time of FCSA with C = 1 and FCSA with C = 4.When dropout did not occur, the running time of FCSA with C = 1 was about 1878 s, and the running time of FCSA with C = 4 was about 1815 s.The difference was negligible.When dropout occurred, the running time of FCSA with C = 1 was about 4872 s, while the running time of FCSA with C = 4 was 4011 s.FCSA with C = 1 applied the longest latency to all dropout nodes, whereas FCSA with C = 4 applied a different latency to dropout nodes depending on the cluster, so the total execution time of FCSA with C = 4 was relatively reduced.

Conclusions and Future Work
The purpose of our study was to propose a practical secure aggregation model that protects the privacy of local updates, is robust against dropouts, reduces computation and communication costs, and reduces the overall running time in federated learning using heterogeneous devices.To achieve this, we proposed a new cluster-based secure aggregation (CSA) strategy that effectively handles dropouts and reduces overhead in situations where the size of the training data, the computing power, and the communication distance of each node are different.
To deal with dropout nodes, the server should first be able to detect dropout occurrence.In other words, the server must determine within a reasonable amount of time whether a non-response of some nodes is due to communication delay or message-drop.Increasing the waiting time to receive responses from nodes can slow down the entire federated learning time as it has to interact with nodes repeatedly.
The proposed CSA method clusters nodes with similar response times and performs cluster-by-cluster aggregation of local updates on nodes.To achieve this, we suggested a grid-based clustering algorithm that clusters nodes according to their processing levels and locations.In the model, the server can estimate an appropriate latency based on the response time of nodes per cluster and can treat unresponsive nodes as dropout nodes after the latency period has elapsed.Therefore, the proposed CSA method provides a reasonable solution that can reduce the overall federated learning time while increasing the decision accuracy for dropout.
We also proposed a new additive sharing-based masking technique to securely aggregate the local updates of nodes.It is robust to dropout users and protects the privacy of local model parameters if each cluster has at least three honest nodes (not colluded with

Conclusions and Future Work
The purpose of our study was to propose a practical secure aggregation model that protects the privacy of local updates, is robust against dropouts, reduces computation and communication costs, and reduces the overall running time in federated learning using heterogeneous devices.To achieve this, we proposed a new cluster-based secure aggregation (CSA) strategy that effectively handles dropouts and reduces overhead in situations where the size of the training data, the computing power, and the communication distance of each node are different.
To deal with dropout nodes, the server should first be able to detect dropout occurrence.In other words, the server must determine within a reasonable amount of time whether a non-response of some nodes is due to communication delay or message-drop.Increasing the waiting time to receive responses from nodes can slow down the entire federated learning time as it has to interact with nodes repeatedly.
The proposed CSA method clusters nodes with similar response times and performs cluster-by-cluster aggregation of local updates on nodes.To achieve this, we suggested a grid-based clustering algorithm that clusters nodes according to their processing levels and locations.In the model, the server can estimate an appropriate latency based on the response time of nodes per cluster and can treat unresponsive nodes as dropout nodes after the latency period has elapsed.Therefore, the proposed CSA method provides a reasonable solution that can reduce the overall federated learning time while increasing the decision accuracy for dropout.
We also proposed a new additive sharing-based masking technique to securely aggregate the local updates of nodes.It is robust to dropout users and protects the privacy of local model parameters if each cluster has at least three honest nodes (not colluded with the federated learning server).We theoretically proved the robustness and security of the proposed aggregation algorithm.Specifically, the proposed masking scheme allows nodes to publicly validate the correctness and integrity of masks created by others.The masks that are randomly created on nodes must be removed during the aggregation at the server for correct aggregation.Therefore, each node must be able to proactively check that the given masks can be eliminated later before generating a secure update with the masks.In the proposed masking scheme, nodes also publish public masks along with random masks securely delivered to other nodes, so the integrity of the given masks can be verified with the public masks based on a discrete logarithm problem.In addition, the masks of each node are created based on a random nonce (cluster nonce) assigned to each cluster.Thus, each node can easily and publicly validate the correctness of the masks created by others by testing that the multiplication of the public masks is equal to the public cluster nonce without knowing the actual masks delivered to others.The proposed masking technique effectively solves the mask verification problem using the discrete logarithm problem to ensure reliable aggregation.Here, the quantization of local parameters is required to use the discrete logarithm in the protocol, and the quantization greatly affects the learning accuracy.However, our simulated results showed that the loss of accuracy due to the quantization could be prevented by using an appropriate quantization level.In our experiments, there was no loss of accuracy when using a quantization level of 300 for FCSA with four clusters.
The proposed CSA method also reduces the computational and communication overhead by allowing nodes to share aggregation-related data only with nodes belonging to the same cluster.Supposing that C is the average cluster size and N is the number of nodes, each node's computational and communication costs are both O(C).On the other hand, the server's computational cost is O(N), while the communication cost is O(NC).We evaluated the performance of the proposed model using with the MNIST dataset.In the simulation for one hundred nodes, FCSA with four clusters outputted about a 91% learning accuracy regardless of the dropout rate, whereas a centralized FCSA (with one cluster) outputted about a 87% accuracy without dropout.For a dropout ratio of 50%, FCSA with one cluster dropped to about a 79% accuracy.In a centralized federated learning model, the accuracy is highly dependent on the dropout rate.However, in our model, a dropout rate of 50% hardly changed the accuracy.This was because the remaining nodes in the high-PS-level clusters still produced good learning results with relatively large amounts of data.The results showed that the proposed CSA method was suitable for federated learning using heterogeneous devices with different sizes of training data.
Our model basically assumes that all participants, including the federated learning server and nodes, are honest, so it does not separately verify each user's local updates and the server's aggregation result.However, the server and nodes cannot be fully trusted in the real world, so the proposed CSA method is vulnerable to Byzantine users.The CSA method also has a drawback in that a bottleneck can occur on the server.Therefore, we will conduct further research on verifiable cluster-based aggregation that can validate user results and aggregated results.With the verifiable aggregation, the proposed CSA method can be extended to a cross-device federated learning model, in which reliable and powerful nodes act like cluster heads that perform intermediate aggregation on behalf of the server.So, the bottleneck of the server can be also resolved.
We ultimately aim to develop a federated learning platform for medical diagnosis.Since medical data is the most-sensitive personal data, hospitals are extremely reluctant to expose it to the outside, so federated learning through secure aggregation is suitable for developing a medical diagnosis model based on medical data scattered across hospitals.In particular, each hospital deals with different diseases, and the size of the hospital and the corresponding patient data are different, and the diagnosis results are also different.Therefore, it is necessary to cluster hospitals according to the type of disease, the size of the hospital, the size of patient ): CC = floating point operations/second, PS = training data size (MB)/CC (5) Whenever a new federated learning starts, every node calculates PS and sends the PS, training data size, and GPS information to the FS.The key strategies for node clustering are: -Each cluster must contain at least four nodes (the minimum cluster size denoted as δ is four).-PS is divided into K levels (K is systemically predefined).-The entire area where all nodes are distributed is divided into an a × b grid (a and b are systemically predefined), and the nodes are mapped to the grid by GPS information.-Node clustering starts with the nodes closest to the FS according to the PS level.This repeats sequentially for the next neighbor nodes around the FS until all nodes are clustered.-Clusters are finally rebalanced so that each cluster satisfies δ.
, and G rf+1, cf +1 .In this way, the next neighbor is changed by increasing the index by 1, and the node clustering is repeated for all nodes in the next neighbor.The blue dotted line in Figure1represents the neighbor in each round.As shown in the figure, u 1 , u 2 , and u 3 are clustered in the first round according to the PS level, and then u 4 is clustered in the second round.The rest of nodes are clustered in the same way.Electronics 2023, 12, 870 9 of 24shows an example of node distribution over a 6 × 5 grid.The number in a circle indicates the PS level of each node.

Figure 1 .
Figure 1.An example of node distribution over a grid.
Algorithm 1: Node Clustering Input: G: a grid of nodes, mr: the maximum row index, mc: the maximum column index, rf: the row index of the server cell, cf: the column index of the server cell, U = {u1, …, uN} and each ui's GPS information and PS level.Output: K clusters C1, …,CK create K empty clusters C1, …,CK, where each Ci is hierarchically structured with an initial bucket B = ∅; for each node u in U generate Ui,j using each u's GPS information and G; for (D = 0; (rf + D <= mr||cf + D <= mc); D++) for each cluster Ci if (the leaf bucket of Ci ≠ ∅)

Figure 1 .
Figure 1.An example of node distribution over a grid.

Figure 2 .
Figure 2. Results after clustering.Once the clustering is complete, the FS forms the final clusters by merging the clusters whose size is smaller than .The merge begins from the cluster level K and repeats the following process.If CK does not satisfy , and if CK−1 is sufficiently big, then some nodes of CK−1 are merged to CK.Otherwise, CK is merged to CK−1.The detailed algorithm is given below.Algorithm 2: Merge Clusters while K > 1 if |CK| <  l =  − |CK|; if (|CK−1| − l) >=  l nodes at the highest level in CK−1's node structure are merged to CK; else CK−1 = CK−1 ∪ CK; remove CK; K = K − 1; while |C1| <  Cn = the next-order cluster of C1; C1 = Cn ∪ C1; remove Cn;
Figure 3 shows the final results after the cluster merging.First, |C 5 | is 1 and |C 4 | is 1, so u 11 is merged into C 4 .C 5 is discarded.Next, |C 4 | is still 2 and less than 4. |C 3 | is also 1, so u 9 and u 11 in C 4 are merged into C 3 .C 4 is discarded too.Now |C 3 | is 3, but it lacks one node to meet the requirements.|C 2 | is sufficiently big, although one node is moved to C 3 , so u 13 with the longest distance to the FS is merged into C 3 .Since both C 2 and C 1 meet the requirements, the merge process is terminated.Consequently, C 1 , C 2 , and C 3 are finally created, and C 1 contains nodes <u 1 , u 3 , u 5 , u 10 , u 7 >, C 2 contains nodes <u 2 , u 4 , u 6 , u 12 >, and C 3 contains nodes <u 2 , u 4 , u 6 , u 12 >.

( 3 )
Efficiency of secure aggregation: Let C be the average cluster size.The computation cost of a node is O(C), and the FS's cost is O(N).The communication cost of node is O(C), and the FS's cost is O(NC).

Figure 4 .
Figure 4. Accuracy according to the quantization levels.

Figure 4 .
Figure 4. Accuracy according to the quantization levels.

Figure 5 .Figure 6 .
Figure 5. Accuracy by dropout rate and training weights () compares the total execution time of FCSA with C = 1 and FCSA with C = 4.When dropout did not occur, the running time of FCSA with C = 1 was about 1878 s, and the running time of FCSA with C = 4 was about 1815 s.The difference was negligible.When

Figure 5 .Figure 6 .
Figure 5. Accuracy by dropout rate and training weights () compares the total execution time of FCSA with C = 1 and FCSA with C = 4.When dropout did not occur, the running time of FCSA with C = 1 was about 1878 s, and the running time of FCSA with C = 4 was about 1815 s.The difference was negligible.When

Figure 7 .
Figure 7.Total running time by cluster-based dropout.Figure 7. Total running time by cluster-based dropout.

Figure 8 .
Figure 8.Total running time by dropout rate per cluster.

Table 2 .
Computational costs and communicational overheads of FCSA.

Table 3 .
Experimental parameters and values.

Table 4 .
Accuracy according to different dropout rates.

Table 4 .
Accuracy according to different dropout rates.