Compare Where It Matters: Using Layer-Wise Regularization To Improve Federated Learning on Heterogeneous Data

Federated Learning is a widely adopted method to train neural networks over distributed data. One main limitation is the performance degradation that occurs when data is heterogeneously distributed. While many works have attempted to address this problem, these methods under-perform because they are founded on a limited understanding of neural networks. In this work, we verify that only certain important layers in a neural network require regularization for effective training. We additionally verify that Centered Kernel Alignment (CKA) most accurately calculates similarity between layers of neural networks trained on different data. By applying CKA-based regularization to important layers during training, we significantly improve performance in heterogeneous settings. We present FedCKA: a simple framework that out-performs previous state-of-the-art methods on various deep learning tasks while also improving efficiency and scalability.


Introduction
The success of deep learning in a plethora of fields has led to a countless number of research conducted to leverage its strengths (Lecun, Bengio, and Hinton 2015). One main outcome resulting from this success is the mass collection of data (Sejnowski 2018). As the collection of data increases at a rate much faster than that of the computing performance and storage capacity of consumer products, it is becoming progressively difficult to deploy trained state-of-the-art models within a reasonable budget.
Federated Learning (FL) (McMahan et al. 2017) has been introduced as a method to train a neural network with massively distributed data. The most widely used and accepted approach for the training and aggregation process is FedAvg (McMahan et al. 2017). FedAvg is appealing for many reasons, such as negating the cost of collecting data into a centralized location, and effective parallelization across computing units (Verbraeken et al. 2019). Thus, it has been applied to a wide range of researches, including a distributed learning framework on vehicular networks (Samarakoon et al. 2020), IoT devices (Yang et al. 2020), and even as a privacy-preserving method for medical records (Brisimi et al. 2018).
One major issue with the application of FL is the performance degradation that occurs with heterogeneous data. This refers to settings in which data is not independent and identically distributed (non-IID) across clients. The drop in performance is seen to be caused by a disagreement in local optima. That is, because different clients train its copy of the neural network according to its individual local data, the resulting average can stray from the true optimum. Unfortunately, it is realistic to expect non-IID data in many realworld applications (Kairouz et al. 2021;Hsu, Qi, and Brown 2019a). In light of this, many works have attempted to address this problem by regularizing the entire model during the training process (Li et al. 2020;Karimireddy et al. 2020;Li, He, and Song 2021). However, we argue that these works are based on a limited understanding of neural networks.
In this work, we present FedCKA to address these limitations. First, we show that regularizing the first two naturally similar layers are most important to improve performance in non-IID settings. Previous works had regularized each individual layers. Not only is this ineffective for training, it also limits scalability as the number of layers in a model increases. By regularizing only these important layers, performance improves beyond previous works. Efficiency and scalability is also improved, as we do not need to calculate regularization terms for every layer. Second, we show that Centered Kernel Alignment (CKA) is most accurate when comparing the representational similarity between layers of neural networks. Previous works added a regularization term by comparing the representation of neural networks with simple inner products such as the l2-distance (FedProx) or cosine similarity (MOON). By using CKA to more accurately compare and regularize local updates, we improve performance; hence the name FedCKA. Our contributions are summarized as follows: • We improve performance in heterogeneous settings. By building on the most up-to-date understanding of neural networks, we apply layer-wise regularization to only important layers.
• We improve the efficiency and scalability of regularization. By regularizing only important layers, we exclusively show training times that are comparable to Fe-dAvg.

Layers in Neural Networks
Understanding the function of layers in a neural network is an under-researched field of deep learning. It is, however, an important prerequisite for the application of layer-wise regularization. We build our work based on findings of two relevant papers. The first work (Zhang, Bengio, and Singer 2019) showed that there are certain 'critical' layers that define a model's performance. In particular, when layers were re-initialized back to their original weights, 'critical' layers heavily decreased performance, while 'robust' layers had minimal impact. This work drew several relevant conclusions. First, the very first layer of neural networks is most sensitive to reinitialization. Second, robustness is not correlated with the l2-norm or l∞-norm between initial weights and trained weights. Considering these conclusions, we understand that certain layers are not important in defining performance. Regularizing these non-important layers would be ineffective, and may even hurt performance.
The second work (Kornblith et al. 2019) introduced Centered Kernel Alignment (CKA) as a metric for measuring the similarity between layers of neural networks. In particular, the work showed that metrics that calculate the similarity between representations of neural networks should be invariant to orthogonal transformations and isotropic scaling, while invertible to linear transformations. This work drew one very relevant conclusion. For neural networks trained on different datasets, early layers, but not late layers, learn similar representations. Considering this conclusion, if we were to properly regularize neural networks trained on different datasets, we should focus on layers that are naturally similar, and not on those that are naturally different.

Federated Learning with Non-IID Data
Federated Learning typically progresses with the repetition of four steps as shown in Figure 1. 1) a centralized or decentralized server broadcasts a model (the global model) to each of its clients. 2) Each client trains its copy of the model (the local model) with its local data. 3) The client uploads its trained model to the server. 4) The server aggregates the trained model into a single model and prepares it to be broadcast in the next round. These steps are repeated until convergence or other criteria are met.
Works that improve performance on non-IID data generally falls into two categories. The first focuses on regularizing or modifying the client training process (step 2). The second focuses on modifying the aggregation process (step 4). Here, we focus on the former, as it is more closely related to our work. Namely, we focus on FedProx (Li et al. 2020), SCAFFOLD (Karimireddy et al. 2020), and MOON (Li, He, and Song 2021), all of which add a regularization term to the default FedAvg (McMahan et al. 2017) training process.
FedAvg was the first work to introduce Federated Learning. Each client trains a model using a gradient descent loss function, and the server averages the trained model based on the number of data samples each client holds. However, due to the performance degradation in non-IID settings, many works have added a regularization term to the default Fe-dAvg training process. The objective of these methods is to decrease the disagreement in local optima by limiting local updates that stray too far from the global model. FedProx adds a proximal regularization term that calculates the l2distance between the local and global model. SCAFFOLD adds a control variate regularization term that induces variance reduction on local updates based on the updates of other clients. Most recent and most similar to our work is MOON. MOON adds a contrastive regularization term that calculates the cosine similarity between the MLP projections of the local and global model. The work takes inspiration from contrastive learning, in particular, SimCLR (Chen et al. 2020). The intuition is that the global model is less biased than local models, thus local updates should be more similar to the global model than past local models. One difference to note is that while contrastive learning trains a model using the projections of one model on many different images (i.e. one model, different data), MOON regularizes a model using the projections of different models on the same images (i.e. three models, same data).
Overall, these works add a regularization term by comparing all layers of the neural network. However, we argue that only important layers should be regularized. Late layers are naturally dissimilar when trained on different datasets. Regularizing a model based on these naturally dissimilar late layers would be ineffective. Rather, it may be beneficial to focus only on the earlier layers of the model.

Regularizing Naturally Similar Layers
FedCKA is designed on the principle that naturally similar, but not naturally dissimilar, layers should be regularized. This is based on the premise that early layers, but not late layers, develop similar representations when trained on different datasets (Kornblith et al. 2019). We verify this in a Federated Learning environment. Using a small Convolutional Neural Network, we trained 10 clients for 20 communications rounds on independently and identically distributed (IID) subsets of the CIFAR-10 (Krizhevsky 2009) dataset. After training, we viewed the similarity between each layer of the local and global models, calculated by the Centered Kernel Alignment (Kornblith et al. 2019) on the CIFAR-10 test set. The similarity of each layer between local and global models are shown in Figure 2. We verify that early layers, but not late layers, develop similar representations even in the most optimal Federated Learning setting, where the distribution across data between clients are IID.
The objective of regularizing local updates is to penalize updates that stray from the global model. However, late layers are naturally dissimilar even in optimal Federated Learning settings. If this is the case, regularizing these late layers would penalize updates that may have been beneficial to training. Thus, FedCKA regularizes only the first two naturally similar layers. For convolutional neural networks without residual blocks, the first two naturally similar layers are the two layers closest to the input. For ResNets (He et al. 2016), it is the initial convolutional layer and first postresidual block. As also mentioned in Kornblith et al. (2019), post-residual layers, but not layers within residuals, develop similar representations. This is unique to previous works, which had regularized local updates based on all layers. This also allows FedCKA to be much more scalable than other methods. The computational overhead for previous works increases rapidly in proportion to the number of parameters, because all layers are regularized. FedCKA keeps the overhead nearly constant, as we regularize only two layers close to the input.

Measuring Layer-wise Similarity
FedCKA is designed to regularize dissimilar updates in layers that should naturally be similar. However, there is currently no standard for measuring the similarity of layers between neural networks. While there are classical methods of applying univariate or multivariate analysis for comparing matrices, these methods are not suitable for comparing the similarity of layers and representations of different neural networks (Kornblith et al. 2019). As for norms, Zhang, Bengio, and Singer (2019) concluded that a layer's robustness to re-initialization is not correlated with the l2-norm or l∞-norm. This suggests that using these norms to regularize dissimilar updates, as in previous works, may be inaccurate. Kornblith et al. (2019) concluded that similarity metrics for comparing the representation of different neural networks should be invariant to orthogonal transformations and isotropic scaling, while invertible to linear transformation. The work introduced Centered Kernel Alignment (CKA), and showed that the metric is most consistent in measuring the similarity between representation of neural networks. Thus, FedCKA regularizes local updates using the CKA metric as a similarity measure.

Modifications to FedAvg
FedCKA adds a regularization term to the local training process of the default FedAvg algorithm, keeping the entire framework simple. Alg 1 and Fig 3 shows the FedCKA framework in algorithm and figure form, respectively. More formally, we add cka as a regularization term to the FedAvg training algorithm. The local loss function is as shown in Eq 1.
= sup (w t li ; D i ) + µ cka (w t li ; w t g ; w t−1 li ; D i ) (1) Here, sup is the cross entropy loss, µ is a hyper-parameter to control the strength of the regularization term, cka , in proportion to sup . cka is shown in more detail in Eq 2.
The formula of cka is a slight modification to the contrastive loss that is used in SimCLR (Chen et al. 2020). There are four main differences. First, SimCLR uses the representations of one model on different samples in a batch to calculate contrastive loss. FedCKA uses the representation of three models on the same samples in a batch to calculate cka . a t li , a t−1 li , and a t g are the representations of client i's current local model, client i's previous round local model, and the current global model, respectively. Second, SimCLR uses the temperature parameter τ to increase performance on difficult samples. FedCKA excludes τ , as it was not seen to help performance. Third, SimCLR uses cosine similarity to measure the similarity between the representations of difference datasets. FedCKA uses CKA as its measure of similarity. Fourth, SimCLR calculates contrastive loss once per batch, using the representations of the projection head. Fed-CKA use calculates cka M times per batch, using the representations of the first M naturally similar layers, indexed by n, and averages the loss based on the number of layers to regularize. M is set to two by default unless otherwise stated. (2) As per Kornblith et al. (2019), CKA is shown in Eq 3. Here, the i th eigenvalue of XX T is λ i X . with 16 and 32 channels respectively, and two 2x2 maxpooling layers following each convolutional layer. A projection head of four fully connected layers follow the encoder, with 120, 84, 84, and 256 neurons. The final layer is the output layer with the number of classes. Although FedCKA and other works can perform without this projection head, we include it because MOON shows a high discrepancy in performance without it. For CIFAR-100 and Tiny ImageNet, we use ResNet-50 (He et al. 2016). We also add the projection head before the output layer, as per MOON. We use the cross entropy loss, and SGD as our optimizer with a learning rate of 0.1, momentum of 0.9, and weight decay of 0.00001. Local epochs are set to 10. These are also the parameters used in MOON. Some small changes we made were with the batch size and communication rounds. We use a constant 128 for the batch size, and train for 100 communication rounds on CIFAR-10, and 40 communication rounds on CIFAR-100, and 20 communication rounds on Tiny Ima-geNet. We use a lower number of communication rounds for the latter two datasets, because the ResNet-50 model over-fit quite quickly.  As with many previous works, we use the Dirichlet distribution to simulate heterogeneous settings (Hsu, Qi, and Brown 2019b;Lin et al. 2021;Li, He, and Song 2021). The α parameter controls the strength of heterogeneity, with α = 0 being most heterogeneous, and α = ∞ being nonheterogeneous. We report results for α ∈ [5.0, 0.5, 0.1], similar to MOON. Figure 4 shows the distribution of data across clients on the CIFAR-10 dataset with the different α. All experiments were conducted using the PyTorch (Paszke et al. 2019) library on a single GTX Titan V and four Intel Xeon Gold 5115 processors.

Accuracy
FedCKA adds a hyperparameter µ to control the strength of cka . We tune µ from [3, 5, 10], and report the best results. MOON and FedProx also have a µ term. We also tune the hyperparameter µ with these methods. For MOON, we tune µ from [0.1, 1, 5, 10] and for FedProx, we tune µ from [0.001, 0.01, 0.1, 1], as used in each work. In addition, for MOON, we use τ = 0.5 as reported in their work. Table 1 shows the performance across CIFAR-10, CIFAR-100, and Tiny ImageNet with α = 5.0. For FedProx, MOON, and FedCKA, we report performance with the best µ. For FedCKA, the best µ is 3, 10, and 3 for CIFAR-10, CIFAR-100, and Tiny ImageNet, respectively. For MOON, the best µ is 10, 5, and 0.1. For FedProx, the best µ is 0.001, 0.1, and 0.1. Table 2 shows the performance across increasing heterogeneity on the CIFAR-10 dataset with α ∈ Method α = 5.0 α = 0.5 α = 0.  [5.0, 0.5, 0.1]. For FedCKA, the best µ is 5, 3, and 3 for each α ∈ [5.0, 0.5, 0.1], respectively. For MOON, the best µ is 0.1, 10, and 10. For FedProx, the best µ is 0.001, 0.1, and 0.001. We observe that FedCKA consistently outperforms previous methods across different datasets and across different α. FedCKA improves performance in heterogeneous settings owing to regularizing layers that are naturally similar, and not layers that are naturally dissimilar. It is also interesting to see that FedCKA performs better by a larger margin when α is larger. This is likely because the global model is less biased as data distribution approaches IID settings, thus can more effectively regularize updates. However, we also observe that other works consistently improve performance, albeit by a smaller margin than FedCKA. FedProx and SCAFFOLD improve performance likely owing to their inclusion of naturally similar layers in regularization. The performance gain is lower, as they also include naturally dissimilar layers in regularization. MOON improves performance compared to FedProx and SCAFFOLD likely owing to their use of a contrastive loss. That is, MOON shows that neural networks should be trained to be more similar to the global model than past local model, rather than only be blindly similar the global model. By regularizing naturally similar layers using a contrastive loss based on CKA, FedCKA outperforms all methods.
Note that across most methods and settings, there are discrepancies to the accuracy reported by MOON (Li, He, and

Regularizing Only Important Layers
We study the effects of regularizing different number of layers. Using the CIFAR-10 dataset with α = 5.0, we change the number of layers to regularize through cka . Formally, we change M in Eq 2 by scaling M ∈ [1,2,3,4,5,6,7], and report the accuracy in Figure 5. Accuracy is highest when only the first two layers are regularized. This verifies our claim that only naturally similar, but not naturally dissimilar layers should be regularized (Figure 2). In addition, note the dotted line representing the upper bound for Federated Learning. When the same model is trained on a centralized server with the whole CIFAR-10 dataset, accuracy is 70%. FedCKA with regularization on the first two naturally similar layers nearly reaches this upper bound.

Using the Best Similarity Metric
We study the effects of regularizing the first two naturally similar layers with different similarity metrics. Using the CIFAR-10 dataset with α = 5.0, we change the similarity metric to regularize through cka . Formally, we   (a 1 , a 2 )). Second, the squared Frobenius norm ( a 1 − a 2 2 F ). Third, the vectorized cosine similarity ( vec(a 1 ) vec(a 2 ) cos θ). We compare the results with these different metrics as well as the baseline, FedAvg. The results are shown in Table 3.
We observe that performance is highest when CKA is used. This is likely owing to the accuracy of measuring similarity. Only truly dissimilar updates are penalized, thus improving performance. In addition, while kernel CKA slightly outperforms linear CKA, considering the computational overhead, we opt to use linear CKA. We also observe that the squared Frobenius norm and vectorized cosine similarity decrease performance only slightly. These methods outperform most previous works. This verifies that while it is important to use an accurate similarity measure, it is more important to focus on regularizing naturally similar layers.

Efficiency and Scalability
Efficient and scalable local training is an important engineering principle of Federated Learning. That is, for Federated Learning to be applied to real-world applications, we must assume that clients have limited computing resources. Thus, we analyze the local training time of all methods, as shown in Table 4. Note that FedAvg is the lower bound for training time, since all other methods add a regularization term.
For a 7-layer CNN trained on CIFAR-10, the training time for all methods are fairly similar. FedCKA extends training by the largest amount, as the matrix multiplication operation to calculate the CKA similarity is proportionally expensive to the forward and back propagation of the small model. However, for ResNet-50 trained on Tiny ImageNet, we see that the training time of FedProx, SCAFFOLD, and MOON have increased exponentially. Only FedCKA has comparable training times to FedAvg. This is because FedProx and SCAFFOLD performs expensive operations on the weights of each layer, and MOON performs forward propagation on three models until the penultimate layer. All these operation scale exponentially as the number of layers increase. While FedCKA also performs forward propagation on three models, the number of layers remains static, thus being most efficient with medium sized models.
We emphasize that regularization must remain scalable for Federated Learning to be applied to state-of-the-art models. Even on ResNet-50, which is no longer considered a large model, other Federated Learning regularization methods lack scalabililty. This causes difficulty to test these methods with the current state-of-the-art models such as ViT (Dosovitskiy et al. 2021) having 1.843 billion parameters, or slightly older models such as EfficientNet-B7 (Tan and Le 2019) having 813 layers.

Conclusion and Future Work
Improving the performance of Federated Learning on heterogeneous data is a widely researched topic. However, many previous works have incorrectly suggested that regularizing every layer of neural networks during local training is the best method to increase performance. We propose FedCKA, built on the most up-to-date understanding of neural networks. By regularizing naturally similar, but not naturally dissimilar layers during local training, performance improves beyond previous works. We also show that Fed-CKA is the only existing regularization method with adequate scalability when trained with a moderate sized model.
FedCKA shows that the proper regularization of important layers improves the performance of Federated Learning on heterogeneous data. However, standardizing the comparison of neural networks is an important step in a deeper understanding of neural networks. Moreover, there are questions as to the accuracy of CKA in measuring similarity in models such as Transformers or Graph Neural Networks. These are some topics we leave for future works.