Efficient Federated Learning Method FedLayerPrune Based on Layer Adaptive Pruning

He, Wenlong; Cao, Hui; Zhang, Jisai; Yang, Decao

doi:10.3390/electronics15051049

Open AccessArticle

Efficient Federated Learning Method FedLayerPrune Based on Layer Adaptive Pruning

Key Laboratory of Minzu Languages and Cultures Intelligent Information Processing, Gansu Province, Northwest Minzu University, Lanzhou 730030, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(5), 1049; https://doi.org/10.3390/electronics15051049

Submission received: 1 February 2026 / Revised: 19 February 2026 / Accepted: 23 February 2026 / Published: 2 March 2026

(This article belongs to the Section Computer Science & Engineering)

Download

Browse Figures

Versions Notes

Abstract

As a privacy-preserving distributed machine learning paradigm, federated learning (FL) faces serious communication bottlenecks in practical deployment. In this paper, we propose FedLayerPrune, a communication-efficient federated learning method that integrates three synergistic components: (i) a layer-adaptive pruning strategy that dynamically allocates pruning rates based on layer sensitivity and network depth; (ii) a heterogeneity-aware aggregation mechanism that combines sample-size weighted averaging with mask consensus voting to enhance robustness under non-IID data distributions; and (iii) a dynamic pruning rate scheduler that progressively increases compression intensity across training rounds. Unlike existing approaches that apply uniform pruning or consider these techniques in isolation, FedLayerPrune achieves a principled coordination among layer-wise importance evaluation, temporal pruning scheduling, and heterogeneous model aggregation. Extensive experiments on CIFAR-10, MNIST, and Fashion-MNIST demonstrate that FedLayerPrune reduces communication costs by up to 68.3% compared with standard FedAvg, while maintaining model accuracy within a 2% margin. Moreover, our method exhibits stronger robustness and faster convergence under severe non-IID data distributions. These results suggest that FedLayerPrune provides a practical and effective solution for deploying federated learning in resource-constrained edge computing environments.

Keywords:

model pruning; federated learning; communication efficiency; layer adaptation; heterogeneous data

1. Introduction

With the rapid proliferation of Internet of Things (IoT) devices, mobile computing platforms, and edge intelligence, the demand for distributed machine learning services has grown substantially. In this context, achieving efficient collaborative model training while preserving user data privacy has become a critical concern in both academia and industry. Federated learning (FL) [1] addresses this challenge by enabling multiple clients to collaboratively train a shared global model through exchanging model parameters or gradient updates, without directly sharing raw data. This paradigm effectively mitigates privacy risks associated with centralized data aggregation while leveraging distributed computational resources, thus offering significant potential in applications such as healthcare, financial services, and smart mobile terminals [2,3]. Despite its advantages in privacy preservation and distributed modeling, federated learning faces substantial challenges in practical deployment, among which communication inefficiency remains one of the most prominent bottlenecks [2]. In a typical FL framework, clients must frequently exchange large-scale model parameters with a central server. For modern deep neural networks with millions or even billions of parameters, this communication overhead can cause severe training delays and performance degradation, particularly in bandwidth-limited or unstable network environments. Therefore, reducing communication costs while preserving model performance has become a central research objective in federated learning [2,3]. Model pruning [4,5,6] has emerged as a promising approach for communication reduction, as it directly decreases the number of transmitted parameters by removing redundant or less important weights. Recent works have explored various pruning strategies in federated settings, including dynamic sparse training [7], adaptive model pruning [8], memory-efficient pruning [9], and lottery ticket-based approaches [10]. However, existing methods still exhibit the following limitations:

First, most approaches adopt a uniform global pruning rate across all network layers, neglecting the well-established observation that different layers exhibit fundamentally different levels of redundancy and sensitivity [11]. For instance, shallow convolutional layers responsible for low-level feature extraction are typically far more sensitive to pruning than fully connected classification layers.

Second, current pruning methods in FL generally lack adaptation to client heterogeneity. In practical deployments, clients often differ significantly in data distribution, computational capabilities, and network conditions. A uniform pruning and aggregation strategy fails to account for these disparities, leading to suboptimal global model performance, especially under non-IID data distributions.

Third, static pruning strategies that apply a fixed pruning rate throughout training cannot adapt to different training phases. Aggressive pruning in early rounds may disrupt model convergence, while conservative pruning in later rounds sacrifices potential communication savings.

While individual techniques addressing some of these issues have been explored in prior work (e.g., layer-wise pruning [12], dynamic sparsity [7], heterogeneous aggregation [13]), no existing method integrates all three aspects into a unified and coordinated framework. In this paper, we propose FedLayerPrune, a communication-efficient federated learning method that achieves a synergistic integration of layer-adaptive pruning, dynamic pruning scheduling, and heterogeneity-aware aggregation. Our key insight is that these three components are not merely additive—they interact in a complementary manner, as demonstrated by our ablation study showing that the combined performance gain (3.4%) exceeds the sum of individual contributions when components are removed independently. The core contributions of this work are as follows:

(1): Layer-adaptive pruning strategy: We propose a principled mechanism that dynamically allocates pruning rates based on layer type, network depth, and gradient-based importance scores. Specifically, sensitive convolutional layers receive conservative pruning, while redundant fully connected layers undergo more aggressive compression, thereby preserving critical feature extraction capabilities.
(2): Heterogeneity-aware aggregation mechanism: We introduce a dual mechanism combining sample-size weighted averaging with mask consensus voting for global model aggregation. This design explicitly accounts for data distribution heterogeneity across clients, enhancing model robustness and generalization under non-IID settings.
(3): Dynamic pruning rate scheduling: We design a progressive pruning scheduler that coordinates with the training process—maintaining low pruning rates during early convergence-critical phases and gradually increasing compression intensity in later rounds—achieving an effective balance between communication efficiency and model convergence quality.
(4): Comprehensive experimental evaluation: Systematic experiments on CIFAR-10, MNIST, and Fashion-MNIST datasets demonstrate that FedLayerPrune achieves up to 68.3% communication reduction compared with FedAvg while keeping accuracy loss within 2%, and exhibits superior robustness under severe non-IID conditions compared to existing baselines.

The remainder of this paper is organized as follows: Section 2 reviews related work and positions our contributions relative to existing methods; Section 3 presents the detailed design of FedLayerPrune; Section 4 describes the experimental setup and analyzes results; Section 5 discusses the advantages and limitations of the approach; Section 6 concludes the paper and outlines future research directions.

2. Related Works

2.1. Federated Learning Optimization

Federated learning was first proposed by McMahan et al. [1], whose core algorithm FedAvg enables distributed model training without centralized data collection through local stochastic gradient descent (SGD) and periodic parameter averaging. While FedAvg provides an effective foundational framework with inherent privacy guarantees, its performance degrades under system heterogeneity and non-IID data distributions, prompting substantial follow-up research.

To improve FL performance under heterogeneous conditions, numerous optimization algorithms have been proposed. FedProx [14] introduces a proximal regularization term in the local objective to constrain client updates close to the global model, alleviating the update divergence caused by heterogeneous computing capabilities. SCAFFOLD [15] employs control variates to correct client drift, thereby accelerating convergence. FedNova [16] proposes a normalized averaging strategy to address inconsistent local training steps across clients, reducing the impact of local optimization discrepancies on the global model. These methods primarily enhance FL stability and convergence from an optimization perspective, but offer limited direct improvement in communication efficiency.

2.2. Model Compression Techniques

With the growing scale of deep neural networks, model compression has become essential for addressing computational and storage bottlenecks. Common compression techniques include parameter quantization [17], knowledge distillation [18,19], and model pruning [4,5]. Among these, pruning has received widespread attention for its ability to significantly reduce model size while preserving performance by removing redundant weights or structures.

Based on pruning granularity, existing methods fall into two main categories:

Unstructured pruning: Removes individual weight connections, achieving high compression ratios but producing sparse models that require specialized hardware or sparse computation libraries for acceleration. Dynamic sparse training methods such as RigL [6] has been proposed to maintain performance during the pruning process.
Structured pruning: Operates at the level of channels, filters, or entire layers, preserving network regularity and enabling efficient deployment on general-purpose hardware platforms.

In federated learning scenarios, structured pruning is often more advantageous, as it simultaneously reduces both the communication payload and the local computational overhead at client devices. Recent advances in few-shot learning [20] have also demonstrated the importance of efficient feature representation, which motivates our layer-aware approach to preserving critical features during pruning.

2.3. Communication Optimization in Federated Learning

Communication efficiency remains a primary bottleneck for practical FL deployment. Researchers have proposed optimization strategies from multiple perspectives:

Gradient compression: Techniques including quantization [17,21], sparsification, and truncation reduce the communication burden by compressing gradient or parameter updates. Advanced approaches such as adaptive gradient quantization [22,23,24] and predictive coding [25] have been proposed to further improve communication efficiency.
Model compression: Directly simplifies the transmitted model to reduce per-round data volume.

The integration of pruning into federated learning has become an active research direction. PruneFL [26] proposes an adaptive pruning strategy but applies a globally uniform pruning rate without exploiting inter-layer differences. HeteroFL [13] enables clients to train sub-models of varying sizes to accommodate device heterogeneity, but introduces additional system complexity. More recently, FedDST [7] investigates dynamic sparse training in federated settings, FedMP [8] proposes adaptive model pruning for computation and communication efficiency, FedMef [9] focuses on memory-efficient dynamic pruning, FLASH [27] addresses concept drift in heterogeneous edge networks, FedRolex [28] enables model-heterogeneous FL through rolling sub-model extraction, and LotteryFL [10] applies the lottery ticket hypothesis to achieve personalized and communication-efficient federated learning.

2.4. Non-IID Data Handling

Handling non-independently and identically distributed (non-IID) data across clients is a fundamental challenge in federated learning, as data heterogeneity can significantly degrade model performance and slow convergence. FedDC [29] addresses non-IID data through local drift decoupling and correction. Zhang et al. [30] provide a comprehensive review covering taxonomy, metrics, and methods for non-IID FL. FedBN [31] proposes local batch normalization to handle non-IID features, allowing each client to maintain its own batch normalization statistics while sharing other model parameters.

2.5. Layer-Wise Techniques

Layer-wise approaches have gained attention in FL for their ability to exploit structural differences across network layers. Wang et al. [12] propose FedLP-Q, combining layer-wise pruning with quantization for efficient FL. Ma et al. [32] investigate layer-wise model aggregation for personalized FL, demonstrating that different layers contribute differently to personalization. These works motivate our layer-adaptive pruning strategy, which dynamically adjusts pruning rates based on layer characteristics.

2.6. Summary and Positioning of FedLayerPrune

As reviewed in recent surveys [2,3,11], existing research has made significant progress in FL optimization, model compression, and communication efficiency. However, current methods typically address individual aspects in isolation: FL optimization methods (FedProx, SCAFFOLD, and FedNova) focus on convergence and stability with limited communication improvement; pruning-based FL methods either apply uniform pruning rates (PruneFL), focus solely on dynamic sparsity (FedDST), or address model heterogeneity without layer-adaptive mechanisms (HeteroFL and FedRolex).

To clearly delineate the novelty of FedLayerPrune relative to existing approaches, Table 1 provides a structured comparison across key design dimensions.

As shown in Table 1, FedLayerPrune is the first method to simultaneously incorporate layer-adaptive pruning, dynamic pruning scheduling, heterogeneity-aware aggregation, and mask regrowth within a unified framework. While individual components draw on ideas explored in prior work, our key contribution lies in their synergistic integration—the coordinated design of these components yields performance improvements that exceed the sum of individual gains, as empirically validated in our ablation study (Section 4.2.5). Specifically, FedLayerPrune addresses the limitations of existing methods by (i) assigning differentiated pruning rates based on layer sensitivity rather than applying uniform compression; (ii) progressively scheduling pruning intensity to respect training dynamics; and (iii) incorporating client data heterogeneity directly into the aggregation process through mask consensus voting. This integrated design provides the theoretical motivation for the method presented in Section 3.

3. Our Method: FedLayerPrune

3.1. Problem Definition

Assume the system consists of K clients, where client k holds a local dataset

D_{k}

with sample size

n_{k} = | D_{k} |

, and the total number of samples is

n = \sum_{k = 1}^{K} n_{k}

. The global optimization objective is the weighted empirical risk minimization:

min_{w \in R^{d}} F (w) ≜ \sum_{k = 1}^{K} \frac{n_{k}}{n} F_{k} (w), F_{k} (w) ≜ E_{(x, y) \sim D_{k}} [ℓ (w; x, y)] .

(1)

Here,

w

denotes the model parameter vector,

d = | w |

is the parameter dimension, and ℓ is a differentiable loss function.

The parameters are grouped by network layers:

w = {w^{(l)}}_{l \in L}

, where the dimension of layer l is

d_{l}

, and

\sum_{l \in L} d_{l} = d

. Define the layer-wise retention ratio

r_{l} \in [0, 1]

(the pruning rate is

p_{l} = 1 - r_{l}

), and a binary mask

m^{(l)} \in {0, 1}^{d_{l}}

. The retained parameters are written as

w^{(l)} ⊙ m^{(l)}

. Let the global mask be

m = {m^{(l)}}_{l}

, whose 0/1 support sets are denoted as

{supp}_{0} (m)

and

{supp}_{1} (m)

, respectively.

In traditional FedAvg, each round of uplink and downlink requires transmitting the full set of parameters, leading to a communication complexity of

O (d)

. Our goal is to learn the layer-wise retention ratios

{r_{l}}

and the global mask

m

such that the expected communication cost per round

E [Comm] \approx \underset{Parameter payload}{\underset{︸}{\sum_{l} r_{l} d_{l}}} + \underset{Mask encoding \cos t}{\underset{︸}{H (m)}}

(2)

is significantly lower than d, while maintaining model accuracy. Here,

H (m)

represents the number of bits required to encode the mask using run-length or sparse index coding.

Equivalent Constrained Form. The problem can be written as a constrained optimization with communication budget B:

min_{w, m} F (w ⊙ m) s . t . \sum_{l} r_{l} d_{l} + H (m) \leq B, r_{l} \in [0, 1],

(3)

or equivalently, as a Lagrangian penalty form:

min_{w, m} F (w ⊙ m) + λ Φ (m),

(4)

where

Φ

is an approximation of the communication cost (e.g.,

ℓ_{0}

or group sparsity approximation).

Remark on the formulation–algorithm relationship. We emphasize that the constrained optimization in Equations (3) and (4) serves as a principled motivation for the algorithm design rather than a problem that is directly solved via Lagrangian optimization. In practice, jointly optimizing the model parameters

w

and the discrete binary masks

m

is NP-hard in general. Instead, FedLayerPrune adopts a computationally tractable heuristic that decomposes the problem into three coordinated sub-procedures: (i) importance-based mask generation guided by Fisher information scores, which implicitly minimizes the loss increase due to pruning; (ii) layer-adaptive pruning rate allocation, which respects the per-layer communication budget constraint

r_{l} d_{l}

; and (iii) a progressive scheduling strategy that gradually tightens the effective budget B over training rounds. While this approach does not guarantee convergence to the global optimum of the Lagrangian form, our extensive experiments (Section 4) demonstrate that this heuristic decomposition achieves highly competitive performance in practice, consistent with the success of similar relaxation strategies in the pruning literature [4,6].

Table 2 lists the parameter symbols used in this method and their corresponding meanings.

3.2. Design of FedLayerPrune

The core idea of FedLayerPrune is to dynamically adjust the pruning strategy based on both the structural characteristics of individual network layers and the temporal progression of training.

A round of federated training in FedLayerPrune consists of four stages:

Local training: Each selected client performs E local epochs of SGD on its private dataset $D_{k}$ .
Importance evaluation: Each client estimates parameter importance scores using Fisher information approximation (computed via a single additional backward pass on a mini-batch), with exponential moving average smoothing for stability.
Layer-adaptive pruning: Binary masks and sparse parameters are generated according to layer-specific sensitivity coefficients and the temporal pruning schedule.
Heterogeneity-aware aggregation: The server performs sample-size weighted aggregation on the uploaded sparse models and applies mask consensus voting to produce the global mask, followed by periodic regrowth of pruned connections.

The algorithm workflow is illustrated in Figure 1.

3.2.1. Layer Adaptive Pruning Strategy

We define a decomposable retention ratio for each layer:

r_{l} = g (\underset{layer sensitivity}{\underset{︸}{α_{l}}}, \underset{time scheduling}{\underset{︸}{β_{t}}}, \underset{statistical strength}{\underset{︸}{γ_{l}}}, \underset{base rate}{\underset{︸}{ρ_{base}}}), p_{l} = 1 - r_{l}

(5)

A simple yet effective choice is a multiplicative form:

p_{l} = p_{base} \cdot g (α_{l}, β_{t}), r_{l} = 1 - p_{l}, p_{l} \in [0, 1]

(6)

Layer sensitivity

α_{l}

. Different weights are assigned for different structures:

Convolutional layers (Conv): $α_{l} = 0.6$ (conservative);
Batch normalization (BN): $α_{l} = 0$ (no pruning);
Fully connected layers (FC): $α_{l} = 1.1$ (aggressive);
By depth: shallow layers $\times 0.7$ , deep layers $\times 1.2$ (can be combined).

Time scheduling

β_{t}

. A piecewise warm-up strategy is adopted to balance early stability and late compression:

β_{t} = \{\begin{matrix} 1.0, & t / T < 0.3 \\ 1.0 + 0.5 \cdot \frac{t / T - 0.3}{0.5}, & 0.3 \leq t / T \leq 0.8 \\ 1.5, & t / T > 0.8 \end{matrix}

(7)

Statistical strength

γ_{l}

. Can be chosen as a normalized measure of the layer-wise Hessian trace or gradient variance:

γ_{l} = \frac{tr {{\hat{H}}^{(l)}}}{\sum_{j} tr {{\hat{H}}^{(j)}}} or γ_{l} = \frac{Var {\nabla w^{(l)}}}{\sum_{j} Var {\nabla w^{(j)}}}

(8)

and used to fine-tune

α_{l}

(e.g.,

α_{l} \leftarrow α_{l} / (ϵ + γ_{l})

).

Structured and unstructured hybrid. To balance deployability on general hardware and high compression ratio, a hybrid pruning strategy is adopted:

Structured channel pruning: Sort convolutional channels/filters $w_{c}^{(l)}$ by $ℓ_{2}$ norm, and retain the top $⌊ r_{l} C_{l} ⌋$ channels;
Unstructured sparsity: Within the retained channels, apply fine-grained 0/1 masks according to weight magnitude or Fisher scores.

3.2.2. Parameter Importance Assessment

The importance score

I_{i}

of a parameter

w_{i}

is computed using a diagonal approximation of the Fisher information matrix:

I_{i} = E_{(x, y) \sim D_{k}} [{(\frac{\partial}{\partial w_{i}} ℓ (w; x, y))}^{2}]

(9)

Computation timing and implementation. The importance scores are computed once per round at each client, immediately after the E local training epochs are completed. In practice, a single mini-batch is used for approximation, incurring a cost equivalent to one additional backward pass. To reduce noise and ensure stability across rounds, we apply an exponential moving average (EMA) update:

I_{i}^{(EMA)} \leftarrow ρ I_{i}^{(EMA)} + (1 - ρ) {\hat{I}}_{i}, ρ \in [0, 1] .

(10)

Alternative scorers. If the computational budget is limited, one may use magnitude-based scoring

I_{i} = | w_{i} |

; if structured pruning is of interest, one may use group norms

| w_{c}^{(l)} |_{2}

.

3.2.3. Heterogeneity-Aware Aggregation

Let

S_{t}

denote the set of participating clients in round t. Client k produces sparse parameters

w_{k}^{t + 1} ⊙ m_{k}

. A key challenge in aggregating pruned models is that different clients may prune different parameters, resulting in heterogeneous sparse structures. Standard FedAvg simply averages all parameters, which can dilute important weights with zeros from clients that pruned those parameters.

To address this, FedLayerPrune adopts a dual mechanism combining sample-size weighted aggregation with mask consensus voting:

w_{i}^{(t + 1)} = \sum_{k \in S_{t}} \underset{λ_{k}}{\underset{︸}{\frac{n_{k}}{\sum_{j \in S_{t}} n_{j}}}} m_{k, i} w_{k, i}^{(t + 1)},

(11)

m_{i}^{(global)} = 1 \{\sum_{k \in S_{t}} λ_{k} m_{k, i} > τ\}, τ \in (0, 1) .

(12)

The novelty of this aggregation scheme beyond standard weighted averaging lies in two aspects. First, the parameter-level aggregation in Equation (11) naturally down-weights contributions from clients that pruned a given parameter (since

m_{k, i} = 0

for those clients), effectively performing an importance-weighted fusion that respects each client’s local pruning decision. Second, the mask consensus voting in Equation (12) determines the global sparse structure through a democratic process: a parameter is retained in the global model only if sufficient weighted evidence from participating clients supports its importance. This mechanism prevents any single client’s pruning decision from dominating the global structure, which is particularly beneficial under non-IID settings where different clients may have divergent views on parameter importance.

The threshold

τ

controls the conservatism of the global mask. A lower

τ

(e.g.,

τ = 0.1

) retains parameters that even a minority of clients consider important, yielding denser global models; a higher

τ

(e.g.,

τ = 0.7

) requires broad consensus, producing sparser models. In our experiments, we set

τ = 0.3

by default, which balances between preserving useful parameters discovered by individual clients under non-IID conditions and maintaining communication efficiency. A sensitivity analysis of

τ

is provided in Section 4.2.5.

Mask alignment and regrowth. To prevent premature structural freezing—where potentially important connections are permanently pruned—the server performs a periodic regrowth step every R rounds. Specifically, among the currently pruned parameters (

m_{i}^{global} = 0

), the top-

ζ

fraction ranked by global importance scores are reactivated (set to 1), following the Lottery Ticket principle [5]. Reactivated parameters inherit the current global model weights rather than being re-initialized, preserving learned representations. In all experiments, we use

R = 5

rounds and

ζ = 0.05

(i.e., 5% of pruned parameters are regrown per cycle). This regrowth mechanism has a negligible impact on communication cost since it only affects the global mask broadcast and does not increase the uplink transmission volume.

3.3. Algorithm Pseudocode

Before presenting the full algorithm, we unify the pruning rate update rule. The effective pruning rate at round t for layer l is computed as

p_{l}^{(t)} = min (p_{base} \cdot α_{l} \cdot β_{t}, p_{max}), r_{l}^{(t)} = 1 - p_{l}^{(t)},

(13)

where

p_{base}

is the base pruning rate,

α_{l}

is the layer sensitivity coefficient (Section 3.2.1),

β_{t}

is the temporal scheduling factor (Equation (7)), and

p_{max}

is the maximum allowable pruning rate. This formulation ensures that the effective pruning rate increases progressively over training rounds (controlled by

β_{t}

) while respecting layer-specific sensitivities (controlled by

α_{l}

) and never exceeding

p_{max}

. The full algorithm of FedLayerPrune (Algorithm 1) is presented as follows:

Algorithm 1: FedLayerPrune

3.4. Complexity and Communication Cost Analysis

Local computation. The importance estimation employs a single mini-batch Fisher approximation, incurring an additional cost of approximately one backward pass (

O (d)

). Layer-wise sorting and thresholding require

O (d log d)

time, which can be parallelized to

\sum_{l} d_{l} log d_{l}

across layers.

Communication cost specification. We now provide a precise accounting of the per-round communication cost. In the uplink (client → server), each client transmits: (i) the non-zero parameters

w ⊙ m

, where each parameter is encoded in float32 (4 bytes), yielding a payload of

4 \sum_{l} r_{l} d_{l}

bytes; and (ii) the binary mask

m

, encoded using 1 bit per parameter via bitmap encoding, contributing

d / 8

bytes. For structured channel pruning, the mask can be more efficiently represented using block-level indices, reducing the mask overhead to

O (L)

where L is the number of layers. In the downlink (server → client), the server broadcasts the aggregated global model and global mask using the same encoding scheme. The total per-round communication cost is therefore

{Comm}_{round} = 2 \times (4 \sum_{l} r_{l} d_{l} + \frac{d}{8}) bytes,

(14)

where the factor of 2 accounts for both uplink and downlink transmissions. In practice, the mask encoding term

H (m) = d / 8

bytes is negligible compared to the parameter payload (typically <1% of total communication when using float32), so the effective communication reduction is dominated by the average sparsity

\bar{p} = 1 - \sum_{l} r_{l} d_{l} / d

.

Clarification on $H (m)$ in experiments. In our experimental evaluation (Section 4), the communication cost reported in Table 3 is computed using the exact formula above with bitmap mask encoding. We do not employ run-length encoding or entropy coding in the current implementation, though such techniques could further reduce the mask overhead for highly structured sparsity patterns.

Server aggregation. The weighted aggregation and mask consensus voting are both linear in the number of non-zero parameters

O (| m |_{0})

. The periodic regrowth step (every R rounds) requires an additional linear scan and partial sort of complexity

O (d)

, which is amortized over R rounds and thus contributes negligible overhead.

4. Experiments

4.1. Experimental Setup

4.1.1. Datasets and Data Heterogeneity Modeling

To comprehensively evaluate the performance and generalization capability of the FedLayerPrune algorithm, we conduct experimental validation on three representative image classification benchmark datasets:

CIFAR-10 [33]: This dataset contains 60,000 RGB color images of size $32 \times 32$ pixels, covering 10 mutually exclusive object categories (airplanes, automobiles, birds, etc.). We follow the standard split, using 50,000 images for training and 10,000 images for testing. With moderate image content complexity, CIFAR-10 is widely used to evaluate the performance of deep neural networks in federated learning scenarios.
MNIST: As a classic handwritten digit recognition dataset, MNIST contains 70,000 grayscale images of size $28 \times 28$ pixels, covering 10 digit categories from 0 to 9. The standard configuration uses 60,000 images for training and 10,000 images for testing. Although MNIST is relatively simple, it holds significant value for validating the fundamental effectiveness of algorithms.
Fashion-MNIST [34]: This dataset maintains the same image specifications and quantity distribution as MNIST, but replaces the recognition targets with 10 categories of clothing items (T-shirts, trousers, sweaters, etc.). Fashion-MNIST offers higher complexity than MNIST, providing a more challenging classification task that helps verify the adaptability of algorithms across different difficulty levels.

Data Heterogeneity Modeling: One of the core challenges in federated learning is handling non-independently and identically distributed (non-IID) data. To systematically evaluate the robustness of algorithms under varying degrees of heterogeneity, we employ the Dirichlet distribution

Dir (α)

[30,31] to simulate data heterogeneity in real-world scenarios. Specifically, for K clients and C classes, we first sample a probability vector

p_{k} \sim Dir (α)

for each client k, then distribute data of each class proportionally according to

p_{k}

. The parameter

α

controls the concentration degree of the distribution: smaller

α

values lead to more unbalanced data distribution and stronger heterogeneity. We set

α \in {0.1, 0.5, 1.0}

to simulate strong, moderate, and weak non-IID scenarios respectively, where

α = 0.1

represents extreme heterogeneity (each client may only possess data from a few classes), while

α = 1.0

approaches a uniform distribution. This modeling approach adequately considers the various data distribution challenges that algorithms may encounter in practical deployment.

4.1.2. Neural Network Architecture Configuration

We carefully selected three representative neural network architectures to validate the universality of FedLayerPrune across different model structures:

ResNet-18 [35] (Adapted Version): For the $32 \times 32$ input size of CIFAR-10, we adapted the standard ResNet-18 with modifications including adjusting the stride of the initial convolutional layer from 2 to 1, removing the first max pooling layer, and adjusting the channel configuration of residual blocks. The modified model contains 4 residual block groups with 2 residual blocks per group, totaling approximately 11.2 M parameters. This architecture represents the widely used residual network family in modern deep learning, whose skip connection characteristics pose unique challenges for pruning algorithms.
Convolutional Neural Network (CNN): For MNIST and Fashion-MNIST datasets, we designed a lightweight yet effective CNN architecture. The network contains two convolutional layers (with 32 and 64 filters, respectively, kernel size $5 \times 5$ ), each followed by ReLU activation and $2 \times 2$ max pooling, followed by two fully connected layers (128 hidden units and 10 output units). The total parameter count is approximately 1.2 M. This relatively simple architecture facilitates analyzing the effects of layer-wise pruning strategies. For comparison with lightweight mobile architectures, we also consider the design principles of MobileNets [36] in our pruning strategy.

4.1.3. Baseline Methods and Comparison Algorithms

To comprehensively evaluate the performance advantages of FedLayerPrune, we selected the following representative baseline methods for comparison:

FedAvg [1] (Federated Averaging): The classic federated learning algorithm proposed by McMahan et al., without any model compression mechanism, serving as the performance upper bound reference baseline.
FixedPrune-50 and FixedPrune-70: Static pruning strategies with fixed pruning rates (50% and 70%, respectively). These methods apply identical pruning rates to all layers without considering inter-layer differences or training dynamics, enabling evaluation of the impact of uniform pruning intensities on model performance.
FedPrune [26]: A dynamic pruning method that adjusts global pruning rates based on training progress but does not differentiate the importance variations among different network layers, forming a direct contrast with our layer-adaptive strategy.
FedDST [7]: A representative dynamic sparse training method for federated learning that employs sparse-to-sparse training with periodic topology updates. FedDST serves as a strong baseline for evaluating the benefits of layer-adaptive pruning over uniform dynamic sparsification.
FedProx [14]: A federated optimization method that introduces proximal regularization to handle client heterogeneity. While not a pruning method, FedProx serves as a reference for evaluating robustness under non-IID conditions.

Remark on baseline selection. We note that several recent methods discussed in our related work—including FedMef [9], FedRolex [28], and LotteryFL [10]—are not included as experimental baselines. FedMef’s official implementation was not publicly available at the time of our experiments, and FedRolex addresses model-heterogeneous FL (where clients train models of different architectures), which differs from our homogeneous-model setting. LotteryFL focuses on personalized FL with per-client lottery tickets, making direct comparisons on global model accuracy less meaningful. We include FedDST as the most representative and directly comparable dynamic sparse training baseline, and provide a qualitative comparison with other methods in Table 1.

4.1.4. Hyperparameter Configuration and Experimental Environment

Federated learning configuration. The total number of clients is set to

K = 10

with client participation rate

C = 1.0

(full participation). We adopt this controlled setting to enable precise analysis of algorithm behavior by eliminating stochastic effects from client sampling, following common practice in federated pruning studies [7,26]. To evaluate scalability under more realistic conditions, we additionally conduct experiments with

K = 50

and partial participation

C = 0.3

in Section 4.2.7. The number of local training epochs is set to

E = 3

to balance local computation and communication frequency. The number of global communication rounds is

T = 30

for CIFAR-10 and

T = 20

for MNIST and Fashion-MNIST. The optimizer employs SGD with learning rate

η = 0.01

, momentum

0.9

, and weight decay

5 \times 10^{- 4}

. The batch size is 32.

Pruning strategy configuration. The base pruning rate is

p_{base} = 0.2

, serving as the initial conservative pruning rate, while the maximum pruning rate is

p_{\max} = 0.6

, serving as the target pruning rate in later stages. The pruning rate growth factor

β = 0.1

controls the gradual increase in the pruning rate, and the layer importance evaluation window is

W = 5

rounds for computing the moving average. The mask consensus voting threshold is set to

τ = 0.3

by default (sensitivity analysis reported in Section 4.2.5). The regrowth interval is

R = 5

rounds, meaning the server performs mask regrowth every 5 communication rounds. The regrowth ratio is

ζ = 0.05

, reactivating the top 5% of pruned parameters ranked by global importance at each regrowth step. The EMA smoothing factor for importance scores is

ρ = 0.9

.

Experimental environment. All experiments are conducted on a server cluster equipped with NVIDIA RTX 3090 GPUs (24 GB memory), Intel Xeon Gold 6226R CPUs, and 128 GB RAM. Each experiment is independently repeated 3 times with different random seeds. We report the mean and standard deviation, and employ the t-test (significance level

α = 0.05

) to verify statistical significance.

4.2. Experimental Results and Analysis

4.2.1. Comprehensive Evaluation of Model Accuracy

Table 4 presents the final test accuracy achieved by different algorithms on three datasets, reflecting the convergence performance after 30 rounds (CIFAR-10) or 20 rounds (MNIST/Fashion-MNIST) of training.

The following key observations can be drawn from the experimental results:

Overall Performance: FedLayerPrune demonstrates superior performance among all pruning-based methods across all test scenarios. On the CIFAR-10 dataset, under moderate non-IID conditions (

α = 0.5

), FedLayerPrune achieves an accuracy of 91.5%, only 0.6 percentage points lower than the uncompressed FedAvg baseline (92.1%). This marginal performance loss is exchanged for approximately 68.3% communication cost savings, demonstrating high compression efficiency. On the MNIST dataset, FedLayerPrune nearly matches FedAvg under certain configurations (98.4% vs. 98.7% at

α = 1.0

), which can be attributed to the regularization effect induced by pruning.

Comparison with FedDST: FedDST, as a strong dynamic sparse training baseline, achieves competitive accuracy (e.g., 90.8% on CIFAR-10 at

α = 0.5

). However, FedLayerPrune consistently outperforms FedDST by 0.4–1.2 percentage points across all settings. This improvement is attributed to the layer-adaptive pruning mechanism, which allocates pruning budgets according to layer sensitivity rather than applying uniform sparsification across all layers. The advantage is particularly pronounced under severe non-IID conditions (

α = 0.1

), where FedLayerPrune achieves 88.7% compared to FedDST’s 87.5% on CIFAR-10.

Comparison with FedProx: FedProx, which addresses client heterogeneity through proximal regularization without pruning, achieves accuracy comparable to FedAvg (e.g., 89.5% vs. 89.2% at

α = 0.1

on CIFAR-10). While FedProx slightly outperforms FedAvg under strong non-IID conditions due to its drift mitigation mechanism, it does not reduce communication costs. FedLayerPrune achieves similar non-IID robustness to FedProx while simultaneously providing 68% communication reduction.

Non-IID Robustness Analysis: As the degree of data heterogeneity increases (

α

decreases from 1.0 to 0.1), all algorithms exhibit varying degrees of performance degradation. However, FedLayerPrune demonstrates the strongest robustness among pruning methods: under severe non-IID conditions (

α = 0.1

), compared to FedAvg, FedLayerPrune incurs only an additional 0.4–0.5 percentage points loss, while FixedPrune-70 suffers up to 7.9 percentage points loss. This advantage primarily stems from our heterogeneity-aware aggregation mechanism.

Comparison with Fixed Pruning Strategies: FixedPrune-50 maintains relatively controllable performance loss while achieving moderate compression. However, at 70% pruning rate, performance degrades sharply, particularly on CIFAR-10 where accuracy drops by more than 5 percentage points. This validates our core observation: different network layers exhibit significantly different sensitivities to pruning, and a uniform high pruning rate severely impairs the representational capacity of critical layers.

4.2.2. Convergence Analysis and Training Dynamics

Figure 2 illustrates the evolution of test accuracy during training for various algorithms on the CIFAR-10 dataset (

α = 0.5

).

The convergence curves reveal several important training dynamics:

Early Training Phase (Rounds 1–10): All methods start training from an initial accuracy of approximately 50%. FedLayerPrune adopts a progressive pruning strategy, achieving model compression while maintaining a fast convergence speed similar to FedAvg. By round 10, FedLayerPrune reaches an accuracy of approximately 83%, whereas FixedPrune-70, due to applying a high pruning rate of 70% from the beginning, only achieves approximately 71% accuracy, with a notably slower convergence speed.

Mid-term Stabilization Phase (Rounds 11–20): FedLayerPrune continues to maintain steady performance improvement, increasing from 83% to approximately 89%, gradually approaching the performance level of FedAvg. FedPrune also performs well during this phase, but remains slightly lower than FedLayerPrune. The performance of FixedPrune-50 falls between the two, reaching approximately 84% accuracy.

Late Optimization Phase (Rounds 21–30): In the later stages of training, FedLayerPrune ultimately converges to 91.5%, demonstrating sustained optimization capability. Notably, the convergence curve of FedLayerPrune (green dashed line) runs nearly parallel to that of FedAvg (blue solid line), maintaining a stable gap of only approximately 0.6 percentage points, while FixedPrune-70 consistently fails to break through the 82% performance bottleneck.

4.2.3. Communication Cost Analysis

Communication overhead is a key bottleneck in the practical deployment of federated learning. Table 3 provides a detailed comparison of communication costs across different methods throughout the entire training process. The communication volumes are computed using the precise formula from Section 3.4: each non-zero parameter is encoded in float32 (4 bytes), and masks are encoded using 1 bit per parameter (bitmap encoding). Both uplink (client → server) and downlink (server → client) transmissions are accounted for.

Detailed communication breakdown. For the CIFAR-10 setting with ResNet-18 (11.2 M parameters,

d = 11.2 \times 10^{6}

), FedAvg transmits

2 \times 4 d = 89.6

MB per round (uplink + downlink, full model), totaling

89.6 \times 30 \times 10 / 1024 \approx 20.16

GB over 30 rounds with 10 clients. For FedLayerPrune with an average sparsity of 52.7%, the parameter payload per direction is

4 \times (1 - 0.527) \times d \approx 21.2

MB, plus the bitmap mask overhead of

d / 8 \approx 1.4

MB, yielding approximately

2 \times (21.2 + 1.4) = 45.2

MB per client per round. The mask overhead constitutes only ∼6.2% of the per-direction transmission, confirming that the bitmap encoding cost is modest relative to the parameter payload.

The communication efficiency analysis yields the following key insights:

CIFAR-10 (ResNet-18) Scenario: FedLayerPrune achieves the best communication-accuracy trade-off with a total transmission volume of only 6.45 GB, saving 68.3% compared to FedAvg’s 20.16 GB. Although the average sparsity of 52.7% is lower than FixedPrune-70’s fixed 70% sparsity, FedLayerPrune achieves higher accuracy (91.5% vs. 85.2% at

α = 0.5

) through intelligent layer-wise allocation. Compared to FedDST (64.0% reduction, 90.8% accuracy), FedLayerPrune achieves 4.3% more communication savings while maintaining 0.7 percentage points higher accuracy, demonstrating the advantage of layer-adaptive pruning over uniform dynamic sparsification.

MNIST and Fashion-MNIST (CNN) Scenarios: On structurally simpler CNN models (1.2 M parameters), FedLayerPrune demonstrates stable and superior performance. The communication savings rates for both datasets are nearly identical (MNIST: 67.8%, Fashion-MNIST: 67.6%), with average sparsity maintained at approximately 51.5–51.9%. This high consistency indicates that FedLayerPrune can automatically adapt to different network scales and data complexities without task-specific tuning.

It is worth noting that although the CNN model has only about one-tenth of the parameters of ResNet-18, FedLayerPrune still achieves a compression rate comparable to that of the larger model (∼68%), while avoiding the performance collapse caused by fixed high pruning rates on small networks. This adaptive capability demonstrates the effectiveness of the layer-aware mechanism.

4.2.4. In-Depth Analysis of Layer-Wise Pruning Rate Distribution

To gain a deeper understanding of the working mechanism of FedLayerPrune, Figure 3 presents the final pruning rate distribution across different layers of ResNet-18 trained on the CIFAR-10 dataset, with a comparison to fixed pruning strategies.

The layer-wise pruning rate distribution exhibits a distinct structural pattern, validating our theoretical hypothesis: the layer-wise pruning rate distribution shown in Figure 3 demonstrates a clear increasing trend, quantitatively verifying the parameter redundancy differences across different network layers.

The pruning rate distribution can be divided into three characteristic regions. First, the input layer conv1 maintains the lowest pruning rate of 32.5%, followed by the layer1 group, where pruning rates progressively increase from 38.2% to 48.7%. This conservative pruning strategy aligns with the feature extraction mechanism of convolutional neural networks—shallow layers are responsible for extracting low-level visual features, and these features play a fundamental role in subsequent layer processing, thus requiring the retention of more parameter capacity.

The middle layers (layer2–layer3) exhibit pruning rates distributed in the range of 45.3–61.2%, showing a steady upward trend. This distribution pattern is consistent with the hierarchical representation theory of deep networks: as network depth increases, feature representations gradually become more abstract, and parameter redundancy correspondingly increases. Notably, pruning rate gradients also exist within each residual block, indicating that the algorithm can finely capture the important differences between layers.

The deep network layers exhibit the highest pruning tolerance, with the layer4 group reaching 61.2–64.9%, and the fully connected layers fc1 and fc2 reaching 72.3% and 78.6%, respectively. The high pruning rates of fully connected layers reflect their inherent over-parameterization characteristics, which are consistent with existing research findings on fully connected layer redundancy. After multiple convolutional operations, features have been sufficiently encoded, and the classification layer only needs to retain key connections to maintain discriminative capability.

This hierarchical pruning rate distribution is not a preset result, but is adaptively formed through the algorithm’s layer importance evaluation mechanism. Compared with fixed pruning strategies that apply uniform processing, FedLayerPrune achieves differentiated compression based on layer characteristics, which is the key factor enabling it to achieve high compression rates while maintaining model performance. This result provides important empirical support for adaptive model compression in federated learning.

4.2.5. Systematic Analysis of Ablation Study

To deeply understand the contribution of each component in FedLayerPrune to overall performance and their interaction mechanisms, we conducted a comprehensive ablation study on the CIFAR-10 dataset. Table 5 presents the performance changes when progressively removing key algorithm components under the moderate non-IID scenario (

α = 0.5

). This systematic decomposition helps reveal the intrinsic logic of the algorithm design and the independent value of each component.

The ablation study reveals the hierarchical nature of FedLayerPrune’s algorithm design and the synergistic relationships among its components. The complete version of FedLayerPrune achieves an optimal balance between performance and efficiency, maintaining 91.5 ± 0.2% accuracy while controlling communication cost at 6.45 GB. This baseline provides a reference for evaluating the contribution of each component.

As the core innovation of the algorithm, the layer-adaptive strategy has been fully validated in the ablation study for its importance. Upon removing this component, the model accuracy drops to 89.8 ± 0.3%, with a performance loss of 1.7 percentage points, while communication cost increases to 7.12 GB. This significant performance degradation confirms our theoretical hypothesis: there exist fundamental differences in parameter importance across different network layers, and uniform pruning strategies cannot fully exploit this structural characteristic. The layer-adaptive mechanism dynamically evaluates the contribution of each layer and adjusts pruning intensity accordingly, effectively preserving key feature extraction capabilities, which is crucial for maintaining model representational capacity under compressed states.

The role of the dynamic pruning rate adjustment mechanism is manifested in maintaining stability throughout the training process. After removing this component, although communication cost increases to 7.89 GB, more critically, accuracy drops by 1.3 percentage points to 90.2 ± 0.3%. This result demonstrates that the progressive pruning strategy during training is not merely an engineering optimization but an important guarantee for ensuring model convergence quality. The progressive strategy allows the model to fully learn data characteristics during early training while gradually increasing the compression rate in later training to optimize communication efficiency. This temporally balanced strategy reflects a deep understanding of federated learning training dynamics.

The heterogeneity-aware aggregation mechanism contributes 0.9 percentage points of accuracy improvement in moderate non-IID environments, increasing accuracy from 90.6 ± 0.3% to the complete version’s 91.5%. Although this improvement is relatively modest in magnitude, its importance should not be underestimated. This mechanism addresses data distribution differences among clients by employing weighted aggregation strategies to mitigate model bias caused by non-IID data. In more extreme data heterogeneity scenarios, the contribution of this component becomes more pronounced, reflecting the algorithm’s adaptive design for complex federated learning environments.

The most compelling evidence comes from the control experiment where all optimization components are completely removed. When the algorithm degenerates to a basic pruning strategy only, accuracy drops sharply to 88.1 ± 0.4%, with a cumulative performance loss of 3.4 percentage points, and communication cost also increases to 8.47 GB. Notably, this cumulative loss (3.4%) is less than the sum of individual component contributions (1.7 + 1.3 + 0.9 = 3.9%), suggesting that certain functional overlap and complementarity exist among components. This nonlinear interaction effect further confirms the design rationality of FedLayerPrune as an integrated system, where components achieve performance improvements beyond simple addition through collaborative work.

From a theoretical perspective, the ablation study results validate the effectiveness of our three core design principles. First, the layer-wise differentiated processing principle is proven through experiments to be key for performance preservation; second, the dynamic adaptation principle during training ensures learning stability and final performance; finally, the federated learning-specific optimization principle (heterogeneity awareness), although contributing relatively less, is indispensable, especially in real-world deployment scenarios.

Sensitivity analysis of threshold $τ$ . The mask consensus voting threshold

τ

controls how aggressively the global mask selects retained parameters. Table 6 reports the sensitivity of FedLayerPrune to different

τ

values on CIFAR-10 (

α = 0.5

).

When

τ

is too low (0.1), the global mask becomes overly permissive, retaining nearly all parameters that any single client considers important. This leads to reduced sparsity (48.1%) and higher communication cost (7.03 GB) while providing marginal accuracy benefits. As

τ

increases to 0.5, sparsity improves further, but accuracy begins to decline because the mask becomes overly restrictive, requiring strong consensus across clients and potentially discarding parameters important for minority data distributions. At

τ = 0.7

, accuracy drops by 1.1 percentage points, indicating that excessive conservatism in mask selection harms model capacity. The default value

τ = 0.3

achieves a favorable balance: it maintains high accuracy (91.5%) while achieving substantial communication reduction (68.3%), with moderate mask stability across rounds.

4.2.6. Investigation of Data Heterogeneity Effects

Non-IID data distribution is a fundamental challenge in federated learning. Figure 4 comprehensively illustrates the performance of different algorithms on the CIFAR-10 dataset as the Dirichlet parameter

α

varies.

The experimental results demonstrate the stable performance of FedLayerPrune under varying heterogeneity levels. In the mildly non-IID scenario (

α = 1.0

), FedLayerPrune achieves 92.9% accuracy, differing from FedAvg (93.5%) by only 0.6 percentage points. As data heterogeneity increases, in the moderately non-IID (

α = 0.5

) and highly non-IID (

α = 0.1

) scenarios, the algorithm maintains accuracy of 91.5% and 88.7%, respectively, exhibiting excellent robustness.

The performance degradation analysis reveals important pattern characteristics. From

α = 1.0

to

α = 0.1

, FedLayerPrune’s accuracy drops by 4.2 percentage points (92.9%→88.7%), achieving a relative performance retention rate of 95.4%. In contrast, FixedPrune-70’s accuracy drops by 6.5 percentage points (87.8%→81.3%), with a relative performance retention rate of only 92.6%. This difference is particularly evident in highly non-IID scenarios: the accuracy gap between FedLayerPrune and FixedPrune-70 expands from 5.1 percentage points in mildly non-IID conditions to 7.4 percentage points, fully demonstrating the advantages of the layer-adaptive strategy under complex data distributions.

Notably, FedLayerPrune outperforms FedPrune across all heterogeneity levels, despite both employing dynamic pruning. This consistent performance advantage (approximately 1–2 percentage points) validates the necessity of layer-wise differentiated processing. Particularly in highly non-IID scenarios, the advantage of FedLayerPrune (88.7%) over FedPrune (87.1%) further expands, indicating the critical role of the layer-adaptive mechanism in addressing data heterogeneity challenges.

From a mechanistic perspective, the robustness of FedLayerPrune stems from the synergistic effects of three key design elements. First, layer-adaptive pruning preserves the model’s core feature extraction capabilities, avoiding excessive compression of critical layers. Second, the progressive pruning strategy provides sufficient adaptation time for the model, enhancing the stability of the learning process. Finally, heterogeneity-aware aggregation effectively mitigates the impact of data distribution bias by dynamically adjusting client weights. Experimental data demonstrate that the contribution of the heterogeneity-aware component increases as the

α

value decreases: contributing 0.3% at

α = 1.0

, 0.9% at

α = 0.5

, and 1.8% at

α = 0.1

, validating the value of this mechanism in extreme heterogeneity scenarios.

Client-level fairness analysis. To provide deeper insight into per-client performance under non-IID conditions, Table 7 reports the client-level accuracy statistics on CIFAR-10 (

α = 0.1

, the most heterogeneous setting).

Several observations emerge from the client-level analysis. First, FedLayerPrune achieves a client accuracy standard deviation of 2.9%, which is lower than FedAvg (3.1%) and FedDST (3.6%), indicating more equitable performance across clients with heterogeneous data distributions. Second, the worst-client accuracy of FedLayerPrune (83.1%) is substantially higher than that of FixedPrune-70 (71.2%) and comparable to FedAvg (83.5%), confirming that layer-adaptive pruning does not disproportionately harm clients with minority data distributions. Third, while FedProx achieves the lowest variance (2.7%) due to its explicit drift mitigation, it incurs zero communication savings. FedLayerPrune achieves comparable fairness (2.9% vs. 2.7%) while reducing communication by 68%, demonstrating the complementary benefits of the heterogeneity-aware aggregation mechanism.

4.2.7. Scalability Analysis with Partial Participation

To evaluate the scalability of FedLayerPrune under more realistic federated learning conditions, we conduct additional experiments on CIFAR-10 (

α = 0.5

) with

K = 50

clients and partial participation rate

C = 0.3

(15 clients selected per round). The number of communication rounds is increased to

T = 50

to allow sufficient convergence with partial participation. Table 8 reports the results.

Under the more challenging partial participation setting (

K = 50

,

C = 0.3

), FedLayerPrune maintains robust performance: the accuracy (90.1%) drops by only 1.4 percentage points compared to the full-participation setting (91.5% at

K = 10

), while still achieving 68.0% communication reduction relative to FedAvg. The convergence speed of FedLayerPrune (reaching 85% accuracy in 14 rounds) is close to FedAvg (12 rounds) and substantially faster than FixedPrune-70 (28 rounds), indicating that the progressive pruning strategy adapts well to the increased variance introduced by client subsampling.

Compared to FedDST under the same partial participation conditions, FedLayerPrune achieves 0.9 percentage points higher accuracy while using 5.0% less total communication, confirming that the advantage of layer-adaptive pruning is maintained when scaling to larger client populations. We note that FedProx achieves slightly higher accuracy (91.1%) in this setting due to its explicit proximal regularization, but at the cost of zero communication savings.

We acknowledge that experiments at an even larger scale (

K \geq 100

) would further strengthen these findings. However, the consistent performance trends observed from

K = 10

to

K = 50

—where FedLayerPrune maintains both its accuracy advantage and communication efficiency relative to baselines—suggest that the method scales gracefully with increasing client populations. We leave experiments on larger-scale cross-device settings and more complex datasets (e.g., CIFAR-100, Tiny-ImageNet) to future work.

5. Discussion

5.1. Analysis of Method Advantages

The effectiveness of FedLayerPrune is primarily attributed to the following key design elements:

Necessity of layer-wise awareness: Experimental results demonstrate that different layers in neural networks exhibit significant variations in their contribution to feature representation. The layer-wise pruning distribution in Figure 3 shows that fully connected layers tolerate up to 78.6% pruning, while shallow convolutional layers require retention rates above 60%. FedLayerPrune exploits this structural property by preserving low-level feature extraction modules while aggressively compressing redundant parameters in higher layers, achieving 68.3% communication savings with only 0.6% accuracy loss.

Value of dynamic pruning strategy: Maintaining a lower pruning rate during the early stages of training facilitates stable learning, while gradually increasing the pruning intensity in later stages effectively reduces communication overhead. The convergence curves in Figure 2 demonstrate that FedLayerPrune’s progressive strategy closely tracks FedAvg’s convergence trajectory, whereas static high pruning (FixedPrune-70) leads to permanently slower convergence.

Effectiveness of heterogeneity handling: The heterogeneity-aware aggregation mechanism, combining sample-size weighting with mask consensus voting, provides complementary benefits: the weighting addresses data volume imbalance, while the mask voting builds consensus on which parameters are structurally important across clients. The client-level analysis (Table 7) shows that this mechanism achieves fairness comparable to FedProx (standard deviation 2.9% vs. 2.7%) while simultaneously providing communication reduction.

5.2. Informal Convergence Analysis

While a complete formal convergence proof is beyond the scope of this work, we provide an informal analysis to build intuition about the convergence behavior of FedLayerPrune. Under standard assumptions commonly adopted in the FL convergence literature [1,14]—namely, L-Lipschitz smooth local objectives, bounded gradient variance

σ^{2}

, and bounded gradient dissimilarity

δ^{2}

—the convergence of FedAvg-type algorithms typically takes the form

\frac{1}{T} \sum_{t = 0}^{T - 1} E [{‖ \nabla F (w^{t}) ‖}^{2}] \leq O (\frac{1}{\sqrt{T}}) + O (\frac{σ^{2}}{K}) + O (δ^{2}) .

(15)

In FedLayerPrune, pruning introduces an additional perturbation: the transmitted sparse model

w ⊙ m

differs from the full model

w

by a masking error

{∥ w - w ⊙ m ∥}^{2}

. The importance-based pruning strategy (Equation (9)) ensures that this error is minimized by removing parameters with the smallest Fisher information scores, which correspond to directions with minimal impact on the loss landscape. As a result, the effective convergence bound can be informally written as

\frac{1}{T} \sum_{t = 0}^{T - 1} E [{‖ \nabla F (w^{t}) ‖}^{2}] \leq O (\frac{1}{\sqrt{T}}) + O (\frac{σ^{2}}{K}) + O (δ^{2}) + \underset{pruning bias}{\underset{︸}{O (ϵ_{prune})}},

(16)

where

ϵ_{prune}

depends on the pruning rate and the distribution of parameter importance scores. The progressive pruning schedule (Equation (7)) ensures that

ϵ_{prune}

is small during early training (when gradients are large) and increases only after the model has largely converged, thereby limiting the impact on final convergence quality.

The mask consensus voting mechanism (Equation (12)) introduces an additional source of bias: parameters that are important for some clients but not others may be pruned. However, the regrowth mechanism (every R rounds) provides a correction pathway that periodically re-evaluates and reactivates pruned parameters based on global importance, preventing premature loss of critical connections. Our experimental results (Section 4.2.2) confirm that FedLayerPrune’s convergence trajectory closely parallels that of FedAvg, with the gap narrowing in later rounds as the model stabilizes. A rigorous convergence analysis incorporating the interplay between dynamic masks, layer-adaptive rates, and heterogeneous aggregation is an important direction for future theoretical work.

5.3. Computational Overhead and Efficiency Analysis

Although FedLayerPrune introduces additional parameter importance evaluation and layer-wise pruning steps, the overall computational overhead remains manageable. The Fisher information approximation (Equation (9)) reuses gradient computations from backpropagation, adding approximately one extra mini-batch forward-backward pass per round (

O (d)

). The layer-adaptive pruning itself involves sorting and thresholding operations with complexity

O (d log d)

, which is negligible compared to the cost of local SGD training.

More critically, the significant reduction in communication cost far outweighs the additional computational consumption. In experiments, FedLayerPrune reduced communication volume by approximately 68% compared to FedAvg. In bandwidth-constrained environments where communication latency dominates wall-clock time, this reduction can lead to substantial end-to-end training speedups.

5.4. Practical Deployment and Application Potential

FedLayerPrune is designed with practical deployment requirements in mind. It requires no modification to the underlying network architecture and can be embedded into mainstream FL frameworks as a plug-in module. The pruning rate and layer sensitivity coefficients can be adjusted according to resource constraints and task requirements. Furthermore, FedLayerPrune is compatible with secure aggregation, differential privacy, and other privacy-preserving techniques, as the sparse model and binary mask can be processed through standard cryptographic protocols without modification.

These characteristics indicate that FedLayerPrune has high application potential in scenarios such as mobile terminals and edge nodes of the Internet of Things (IoT), where both communication bandwidth and device heterogeneity are primary concerns.

5.5. Limitations and Future Research Directions

We acknowledge several limitations of the current work and identify directions for future research:

Theoretical convergence guarantees: While our informal analysis (Section 5.2) provides intuition, a rigorous convergence proof incorporating dynamic masks, layer-adaptive rates, and heterogeneous aggregation remains open. Establishing a formal communication-accuracy trade-off bound of the form $accuracy \geq f (communication budget)$ would significantly strengthen the theoretical contribution.
Larger-scale evaluation: Although our scalability experiments ( $K = 50$ , $C = 0.3$ ) demonstrate promising results, evaluation on larger client populations ( $K \geq 100$ ) and more complex datasets (CIFAR-100, Tiny-ImageNet) would better reflect cross-device FL scenarios. System-level latency measurements incorporating actual network delays would also strengthen the practical assessment.
Adaptive layer sensitivity learning: The current layer sensitivity coefficients $α_{l}$ are manually assigned based on layer type and depth. Future work could explore data-driven approaches, such as meta-learning or neural architecture search, to automatically learn optimal per-layer pruning configurations.
Client-specific pruning strategies: Currently, all clients adopt the same pruning strategy determined by the global model structure. Personalizing pruning rates based on individual client constraints (computing power, bandwidth, and data volume) could further improve the overall system efficiency.
Hardware-aware structured pruning: The current hybrid pruning strategy produces semi-structured sparsity patterns. Incorporating hardware-aware constraints to produce fully structured sparsity (e.g., channel pruning) would enable direct inference acceleration on general-purpose hardware without sparse computation libraries.

6. Conclusions

This paper proposed FedLayerPrune, a communication-efficient federated learning method that integrates layer-adaptive pruning, dynamic pruning rate scheduling, and heterogeneity-aware aggregation into a unified framework. Unlike existing approaches that apply uniform pruning or address individual aspects in isolation, FedLayerPrune achieves a principled coordination among these components, yielding performance improvements that exceed the sum of individual gains (as demonstrated by our ablation study, where the combined accuracy loss of 3.4% is less than the 3.9% sum of individual component removals).

Extensive experiments on CIFAR-10, MNIST, and Fashion-MNIST across varying levels of data heterogeneity demonstrate that FedLayerPrune reduces communication costs by approximately 68% compared with FedAvg while keeping accuracy loss within 2%. FedLayerPrune consistently outperforms both static pruning baselines and the dynamic sparse training method FedDST, achieving higher accuracy with greater communication savings. The client-level fairness analysis confirms that FedLayerPrune maintains equitable performance across clients under severe non-IID conditions, with accuracy variance comparable to the dedicated heterogeneity method FedProx. Scalability experiments with

K = 50

clients and

C = 0.3

partial participation validate that these advantages persist under more realistic federated settings.

We acknowledge that the current work has limitations, including the lack of formal convergence guarantees, manually specified layer sensitivity coefficients, and limited evaluation at a very large scale. Future work will focus on establishing rigorous convergence bounds for the proposed algorithm, exploring data-driven adaptive sensitivity learning, and extending evaluation to larger-scale cross-device FL scenarios with more complex datasets and system-level latency measurements. Overall, FedLayerPrune provides a practical and effective solution for deploying high-performance federated learning models in resource-constrained edge computing and IoT environments.

Author Contributions

Methodology, W.H.; software, W.H.; validation, W.H.; formal analysis, H.C.; investigation, D.Y.; writing—original draft preparation, W.H.; writing—review and editing, J.Z.; supervision, H.C.; funding acquisition, J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Gansu Province Central Government Guided Local Science and Technology Development Fund (Grant No. 25ZYJA034), the National Natural Science Foundation of China (Grant No. 62566058), and the Lanzhou Science and Technology Plan Project (Grant No. 2025-2-42). The APC was funded by the Gansu Province Central Government Guided Local Science and Technology Development Fund (Grant No. 25ZYJA034), the National Natural Science Foundation of China (Grant No. 62566058), and the Lanzhou Science and Technology Plan Project (Grant No. 2025-2-42).

Data Availability Statement

The datasets used in this study are publicly available. The CIFAR10 dataset can be obtained from the following link: http://www.cs.toronto.edu/~kriz/cifar.html (accessed on 10 July 2025). The MNIST dataset can be obtained from http://yann.lecun.com/exdb/mnist/ (accessed on 3 October 2025). The Fashion MNIST dataset can be obtained from https://github.com/zalandoresearch/fashion-mnist (accessed on 15 December 2025).

Conflicts of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

McMahan, H.B.; Moore, E.; Ramage, D.; Hampson, S.; Arcas, B.A.Y. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), Fort Lauderdale, FL, USA, 20–22 April 2017; PMLR: Cambridge, MA, USA, 2017; Volume 54, pp. 1273–1282. [Google Scholar]
Liu, B.; Lyu, N.; Guo, Y.; Xu, Y.; Zhu, S.; Wu, Z.; Shi, C.; Zhong, Y. Recent Advances on Federated Learning: A Systematic Survey. Neurocomputing 2024, 597, 128019. [Google Scholar] [CrossRef]
Wen, J.; Zhang, Z.; Lan, Y.; Cui, Z.; Cai, J.; Zhang, W. A Survey on Federated Learning: Challenges and Applications. Int. J. Mach. Learn. Cybern. 2023, 14, 513–535. [Google Scholar] [CrossRef] [PubMed]
Han, S.; Pool, J.; Tran, J.; Dally, W.J. Learning Both Weights and Connections for Efficient Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 7–12 December 2015; Curran Associates: Red Hook, NY, USA, 2015; Volume 28, pp. 1135–1143. [Google Scholar]
Frankle, J.; Carbin, M. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Evci, U.; Gale, T.; Menick, J.; Castro, P.S.; Elsen, E. Rigging the Lottery: Making All Tickets Winners. In Proceedings of the 37th International Conference on Machine Learning (ICML), Virtual, 13–18 July 2020; PMLR: Cambridge, MA, USA, 2020; pp. 2943–2952. [Google Scholar]
Bibikar, S.; Vikalo, H.; Wang, Z.; Chen, X. Federated Dynamic Sparse Training: Computing Less, Communicating Less, Yet Learning Better. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Virtual, 22 February–1 March 2022; AAAI Press: Palo Alto, CA, USA, 2022; Volume 36, pp. 6080–6088. [Google Scholar]
Jiang, Z.; Wang, Y.; Zhan, C.; Liu, J.; Huang, C. Computation and Communication Efficient Federated Learning With Adaptive Model Pruning. IEEE Trans. Mob. Comput. 2023, 22, 5765–5781. [Google Scholar] [CrossRef]
Huang, H.; Zhuang, W.; Chen, C.; Lyu, L. FedMef: Towards Memory-efficient Federated Dynamic Pruning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 27548–27557. [Google Scholar]
Li, A.; Sun, J.; Wang, B.; Duan, L.; Li, S.; Chen, Y.; Li, H. LotteryFL: Empower Edge Intelligence with Personalized and Communication-Efficient Federated Learning. In Proceedings of the IEEE/ACM Symposium on Edge Computing (SEC), San Jose, CA, USA, 14–17 December 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 68–79. [Google Scholar]
Hoefler, T.; Alistarh, D.; Ben-Nun, T.; Dryden, N.; Peste, A. Sparsity in Deep Learning: Pruning and Growth for Efficient Inference and Training in Neural Networks. J. Mach. Learn. Res. 2021, 22, 1–124. [Google Scholar]
Wang, Z.; Xu, Y.; Xu, J.; Yang, Y.; Zhou, X.; Zhang, J. Towards Efficient Federated Learning: Layer-Wise Pruning-Quantization Scheme and Coding Design. Entropy 2023, 25, 1205. [Google Scholar]
Diao, E.; Ding, J.; Tarokh, V. HeteroFL: Computation and Communication Efficient Federated Learning for Heterogeneous Clients. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 3–7 May 2021. [Google Scholar]
Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smith, V. Federated Optimization in Heterogeneous Networks. In Proceedings of the Machine Learning and Systems (MLSys), Austin, TX, USA, 2–4 March 2020; MLSys: Indio, CA, USA, 2020; Volume 2, pp. 429–450. [Google Scholar]
Karimireddy, S.P.; Kale, S.; Mohri, M.; Reddi, S.; Stich, S.; Suresh, A.T. SCAFFOLD: Stochastic Controlled Averaging for Federated Learning. In Proceedings of the 37th International Conference on Machine Learning (ICML), Virtual, 13–18 July 2020; PMLR: Cambridge, MA, USA, 2020; Volume 119, pp. 5132–5143. [Google Scholar]
Wang, J.; Liu, Q.; Liang, H.; Joshi, G.; Poor, H.V. Tackling the Objective Inconsistency Problem in Heterogeneous Federated Optimization. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Virtual, 6–12 December 2020; Curran Associates: Red Hook, NY, USA, 2020; Volume 33, pp. 7611–7623. [Google Scholar]
Alistarh, D.; Grubic, D.; Li, J.; Tomioka, R.; Vojnovic, M. QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; Curran Associates: Red Hook, NY, USA, 2017; Volume 30, pp. 1707–1718. [Google Scholar]
Gou, J.; Yu, B.; Maybank, S.J.; Tao, D. Knowledge Distillation: A Survey. Int. J. Comput. Vis. 2021, 129, 1789–1819. [Google Scholar] [CrossRef]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
Chen, S.; Wang, H.; Zhang, Y.; Yang, L. Few-Shot Image Classification Algorithm Based on Global-Local Feature Fusion. Electronics 2025, 14, 456. [Google Scholar]
Sun, J.; Chen, T.; Giannakis, G.B.; Yang, Q.; Yang, Z. Lazily Aggregated Quantized Gradient Innovation for Communication-Efficient Federated Learning. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 2031–2044. [Google Scholar] [CrossRef] [PubMed]
Liu, H.; He, F.; Cao, G. Communication-Efficient Federated Learning for Heterogeneous Edge Devices Based on Adaptive Gradient Quantization. In Proceedings of the IEEE INFOCOM 2023, New York, NY, USA, 17–20 May 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–10. [Google Scholar]
Mao, Y.; Zhao, Z.; Yan, G.; Liu, Y.; Lan, T.; Song, L.; Ding, W. Communication Efficient Federated Learning with Adaptive Quantization. ACM Trans. Intell. Syst. Technol. 2022, 13, 1–26. [Google Scholar] [CrossRef]
Zhao, Z.; Mao, Y.; Shi, Z.; Liu, Y.; Lan, T.; Ding, W.; Zhang, X.P. AQUILA: Communication Efficient Federated Learning with Adaptive Quantization in Device Selection Strategy. IEEE Trans. Mob. Comput. 2023, 23, 7363–7376. [Google Scholar] [CrossRef]
Yue, K.; Jin, R.; Wong, C.W.; Dai, H. Communication-Efficient Federated Learning via Predictive Coding. IEEE J. Sel. Top. Signal Process. 2022, 16, 369–380. [Google Scholar] [CrossRef]
Jiang, Y.; Wang, S.; Valls, V.; Ko, B.J.; Lee, W.H.; Leung, K.K.; Tassiulas, L. Model Pruning Enables Efficient Federated Learning on Edge Devices. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 10374–10386. [Google Scholar] [CrossRef] [PubMed]
Babakniya, S.; Kundu, S.; Kundu, S.; Venkatesh, S.; Paiva, A.R.C.; Pal, S. FLASH: Concept Drift Adaptation via Federated Learning for Heterogeneous Edge Networks. In Proceedings of the IEEE 43rd International Conference on Distributed Computing Systems (ICDCS), Hong Kong, China, 18–21 July 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 190–202. [Google Scholar]
Alam, S.; Liu, L.; Yan, M.; Zhang, M. FedRolex: Model-Heterogeneous Federated Learning with Rolling Sub-Model Extraction. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 28 November–9 December 2022; Curran Associates: Red Hook, NY, USA, 2022; Volume 35, pp. 29677–29690. [Google Scholar]
Gao, L.; Fu, H.; Li, L.; Chen, Y.; Xu, M.; Xu, C.Z. FedDC: Federated Learning with Non-IID Data via Local Drift Decoupling and Correction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 10102–10111. [Google Scholar]
Zhang, Z.; Chen, Y.; Wang, X.; Li, B. Non-IID Data in Federated Learning: A Systematic Review with Taxonomy, Metrics, Methods, Frameworks and Future Directions. arXiv 2024, arXiv:2411.12377. [Google Scholar] [CrossRef]
Li, X.; Jiang, M.; Zhang, X.; Kamp, M.; Dou, Q. FedBN: Federated Learning on Non-IID Features via Local Batch Normalization. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 3–7 May 2021. [Google Scholar]
Ma, X.; Zhang, J.; Guo, S.; Xu, W. Layer-wised Model Aggregation for Personalized Federated Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 10092–10101. [Google Scholar]
Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images. Master’s Thesis, University of Toronto, Toronto, ON, Canada, 2009. [Google Scholar]
Xiao, H.; Rasul, K.; Vollgraf, R. Fashion-MNIST: A Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv 2017, arXiv:1708.07747. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 770–778. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]

Figure 1. Workflow of the FedLayerPrune algorithm.

Figure 2. Comparison of convergence curves on the CIFAR-10 dataset.

Figure 3. Pruning rate distribution of each layer in ResNet-18.

Figure 4. Performance comparison analysis under different non-IID levels (CIFAR-10).

Table 1. Comparison of FedLayerPrune with representative federated pruning methods.

Method	Layer-Adaptive Pruning	Dynamic Scheduling	Heterogeneity- Aware Aggreg.	Mask Regrowth	Pruning Type
PruneFL [26]	×	✓	×	×	Unstructured
HeteroFL [13]	×	×	Partial	×	Structured
FedDST [7]	×	✓	×	✓	Unstructured
FedMP [8]	Partial	✓	×	×	Hybrid
LotteryFL [10]	×	×	×	✓	Unstructured
FedRolex [28]	×	✓	Partial	×	Structured
FedLP-Q [12]	✓	×	×	×	Structured
FedLayerPrune (Ours)	✓	✓	✓	✓	Hybrid

Table 2. Notations and their corresponding definitions.

Symbol	Definition
K	Number of clients
$D_{k}, n_{k}$	Dataset and its size of client k
$w, d$	Model parameters and their dimension
$w^{(l)}, d_{l}$	Parameters and dimension of layer l
$r_{l}, p_{l}$	Layer-wise retention and pruning rates ( $p_{l} = 1 - r_{l}$ )
$m^{(l)}$	Binary mask of layer l
$I_{i}$	Parameter importance score (diagonal Fisher information, Equation (9))
$α_{l}$	Layer sensitivity coefficient (type- and depth-dependent)
$β_{t}$	Temporal scheduling factor (Equation (7))
$p_{base}, p_{max}$	Base and maximum pruning rates
$τ$	Mask consensus voting threshold (default 0.3)
R	Regrowth interval in rounds (default 5)
$ζ$	Regrowth ratio per cycle (default 0.05)
$ρ$	EMA smoothing factor for importance scores (Equation (10))
$H (m)$	Mask encoding cost (bitmap: $d / 8$ bytes)

Table 3. Comprehensive analysis of communication costs. Total communication includes both uplink and downlink over all rounds.

Method	CIFAR-10 (ResNet-18)			MNIST (CNN)			Fashion-MNIST (CNN)
Method	Total Comm. (GB)	Reduction vs. FedAvg	Avg. Sparsity	Total Comm. (MB)	Reduction vs. FedAvg	Avg. Sparsity	Total Comm. (MB)	Reduction vs. FedAvg	Avg. Sparsity
FedAvg	20.16	-	0%	576.0	-	0%	576.0	-	0%
FedProx	20.16	0%	0%	576.0	0%	0%	576.0	0%	0%
FixedPrune-50	10.08	50.0%	50.0%	288.0	50.0%	50.0%	288.0	50.0%	50.0%
FixedPrune-70	6.05	70.0%	70.0%	172.8	70.0%	70.0%	172.8	70.0%	70.0%
FedPrune	8.47	58.0%	42.3%	242.9	57.8%	41.8%	244.1	57.6%	41.5%
FedDST	7.26	64.0%	48.5%	207.4	64.0%	48.2%	209.1	63.7%	47.8%
FedLayerPrune	6.45	68.3%	52.7%	185.2	67.8%	51.9%	186.8	67.6%	51.5%

Table 4. Final test accuracy of different algorithms on various datasets (%, mean ± standard deviation).

Method	CIFAR-10			MNIST			Fashion-MNIST
Method	$α$ = 0.1	$α$ = 0.5	$α$ = 1.0	$α$ = 0.1	$α$ = 0.5	$α$ = 1.0	$α$ = 0.1	$α$ = 0.5	$α$ = 1.0
FedAvg	89.2 ± 0.3	92.1 ± 0.2	93.5 ± 0.2	96.8 ± 0.2	98.2 ± 0.1	98.7 ± 0.1	87.3 ± 0.4	89.8 ± 0.3	91.2 ± 0.2
FedProx	89.5 ± 0.3	91.8 ± 0.2	93.2 ± 0.2	96.9 ± 0.2	98.0 ± 0.1	98.5 ± 0.1	87.6 ± 0.3	89.5 ± 0.3	90.9 ± 0.2
FixedPrune-50	85.6 ± 0.5	88.9 ± 0.4	90.7 ± 0.3	94.2 ± 0.3	96.5 ± 0.2	97.3 ± 0.2	84.1 ± 0.5	87.2 ± 0.4	88.9 ± 0.3
FixedPrune-70	81.3 ± 0.7	85.2 ± 0.6	87.8 ± 0.5	91.5 ± 0.5	94.1 ± 0.4	95.6 ± 0.3	80.2 ± 0.8	84.3 ± 0.6	86.7 ± 0.5
FedPrune	87.1 ± 0.4	90.3 ± 0.3	91.8 ± 0.3	95.6 ± 0.3	97.3 ± 0.2	97.9 ± 0.1	85.7 ± 0.4	88.4 ± 0.3	90.1 ± 0.3
FedDST	87.5 ± 0.4	90.8 ± 0.3	92.2 ± 0.3	95.9 ± 0.3	97.5 ± 0.2	98.1 ± 0.1	86.0 ± 0.4	88.7 ± 0.3	90.4 ± 0.2
FedLayerPrune	88.7 ± 0.3	91.5 ± 0.2	92.9 ± 0.2	96.4 ± 0.2	97.8 ± 0.1	98.4 ± 0.1	86.8 ± 0.3	89.3 ± 0.2	90.8 ± 0.2

Table 5. Detailed results of ablation study (CIFAR-10,

α = 0.5

).

Table 5. Detailed results of ablation study (CIFAR-10,

α = 0.5

).

Method Variant	Layer-Adaptive	Dynamic Pruning	Heterogeneity-Aware	Accuracy (%)	Comm. Cost (GB)	Acc. Drop
FedLayerPrune (Full)	✓	✓	✓	91.5 ± 0.2	6.45	–
w/o Layer-Adaptive	×	✓	✓	89.8 ± 0.3	7.12	1.7%
w/o Dynamic Adjustment	✓	×	✓	90.2 ± 0.3	7.89	1.3%
w/o Heterogeneity-Aware	✓	✓	×	90.6 ± 0.3	6.52	0.9%
Basic Pruning Only	×	×	×	88.1 ± 0.4	8.47	3.4%

Table 6. Sensitivity analysis of mask consensus threshold

τ

(CIFAR-10,

α = 0.5

).

Table 6. Sensitivity analysis of mask consensus threshold

τ

(CIFAR-10,

α = 0.5

).

$τ$	Accuracy (%)	Comm. Cost (GB)	Avg. Sparsity	Mask Stability
0.1	91.2 ± 0.3	7.03	48.1%	Low
0.3 (default)	91.5 ± 0.2	6.45	52.7%	Moderate
0.5	91.3 ± 0.2	6.12	55.3%	High
0.7	90.4 ± 0.4	5.68	59.8%	Very High

Table 7. Client-level accuracy statistics under severe non-IID (

α = 0.1

, CIFAR-10).

Table 7. Client-level accuracy statistics under severe non-IID (

α = 0.1

, CIFAR-10).

Method	Mean Acc. (%)	Std. Dev.	Worst Client (%)	Best Client (%)
FedAvg	89.2	3.1	83.5	94.2
FedProx	89.5	2.7	84.8	93.9
FixedPrune-70	81.3	5.8	71.2	89.1
FedDST	87.5	3.6	81.3	92.8
FedLayerPrune	88.7	2.9	83.1	93.5

Table 8. Scalability results on CIFAR-10 (

α = 0.5

) with

K = 50

,

C = 0.3

.

Table 8. Scalability results on CIFAR-10 (

α = 0.5

) with

K = 50

,

C = 0.3

.

Method	Accuracy (%)	Total Comm. (GB)	Reduction	Rounds to 85%
FedAvg	90.8 ± 0.4	50.40	–	12
FedProx	91.1 ± 0.3	50.40	0%	11
FixedPrune-70	83.1 ± 0.8	15.12	70.0%	28
FedDST	89.2 ± 0.5	18.65	63.0%	16
FedLayerPrune	90.1 ± 0.3	16.13	68.0%	14

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

He, W.; Cao, H.; Zhang, J.; Yang, D. Efficient Federated Learning Method FedLayerPrune Based on Layer Adaptive Pruning. Electronics 2026, 15, 1049. https://doi.org/10.3390/electronics15051049

AMA Style

He W, Cao H, Zhang J, Yang D. Efficient Federated Learning Method FedLayerPrune Based on Layer Adaptive Pruning. Electronics. 2026; 15(5):1049. https://doi.org/10.3390/electronics15051049

Chicago/Turabian Style

He, Wenlong, Hui Cao, Jisai Zhang, and Decao Yang. 2026. "Efficient Federated Learning Method FedLayerPrune Based on Layer Adaptive Pruning" Electronics 15, no. 5: 1049. https://doi.org/10.3390/electronics15051049

APA Style

He, W., Cao, H., Zhang, J., & Yang, D. (2026). Efficient Federated Learning Method FedLayerPrune Based on Layer Adaptive Pruning. Electronics, 15(5), 1049. https://doi.org/10.3390/electronics15051049

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficient Federated Learning Method FedLayerPrune Based on Layer Adaptive Pruning

Abstract

1. Introduction

2. Related Works

2.1. Federated Learning Optimization

2.2. Model Compression Techniques

2.3. Communication Optimization in Federated Learning

2.4. Non-IID Data Handling

2.5. Layer-Wise Techniques

2.6. Summary and Positioning of FedLayerPrune

3. Our Method: FedLayerPrune

3.1. Problem Definition

3.2. Design of FedLayerPrune

3.2.1. Layer Adaptive Pruning Strategy

3.2.2. Parameter Importance Assessment

3.2.3. Heterogeneity-Aware Aggregation

3.3. Algorithm Pseudocode

3.4. Complexity and Communication Cost Analysis

4. Experiments

4.1. Experimental Setup

4.1.1. Datasets and Data Heterogeneity Modeling

4.1.2. Neural Network Architecture Configuration

4.1.3. Baseline Methods and Comparison Algorithms

4.1.4. Hyperparameter Configuration and Experimental Environment

4.2. Experimental Results and Analysis

4.2.1. Comprehensive Evaluation of Model Accuracy

4.2.2. Convergence Analysis and Training Dynamics

4.2.3. Communication Cost Analysis

4.2.4. In-Depth Analysis of Layer-Wise Pruning Rate Distribution

4.2.5. Systematic Analysis of Ablation Study

4.2.6. Investigation of Data Heterogeneity Effects

4.2.7. Scalability Analysis with Partial Participation

5. Discussion

5.1. Analysis of Method Advantages

5.2. Informal Convergence Analysis

5.3. Computational Overhead and Efficiency Analysis

5.4. Practical Deployment and Application Potential

5.5. Limitations and Future Research Directions

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI