A Supernet-Only Framework for Federated Learning in Computationally Heterogeneous Scenarios

Chen, Yu; Chen, Danyang; Zhong, Cheng

doi:10.3390/app15105666

Open AccessArticle

A Supernet-Only Framework for Federated Learning in Computationally Heterogeneous Scenarios

by

Yu Chen

¹

,

Danyang Chen

^1,*

and

Cheng Zhong

^1,2

¹

School of Computer, Electronics and Information, Guangxi University, Nanning 530004, China

²

Key Laboratory of Parallel, Distributed and Intelligent Computing in Guangxi Universities and Colleges, Nanning 530004, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(10), 5666; https://doi.org/10.3390/app15105666

Submission received: 18 April 2025 / Revised: 14 May 2025 / Accepted: 14 May 2025 / Published: 19 May 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Federated learning is effective for Internet of Things data privacy and non-independent and identically distributed issues but not device heterogeneity. Neural Architecture Search can alleviate this by constructing multiple model structures to optimize federated learning performance across diverse edge devices. However, existing methods, whether lightweight networks or client grouping, face a tradeoff between scaling to larger federations and utilizing more powerful structures. We decompose residual network blocks, reformulating them as a Neural Architecture Search task. Furthermore, we propose a method for reinterpreting any sequential architecture into a supernet and developed a training pipeline tailored to this reinterpretated architecture, mitigating this frustrating tradeoff. We conduct pretraining on ImageNet1K and federated training on the CIFAR-100, CIFAR-10, and CINIC-10 datasets under both the ring-based federated learning and FedAvg framework. In less constrained environments, our method maintains performance comparable to another top-two method, which varies across experimental settings, while maintaining a margin of at least 1% Top-1 accuracy over the third-best method. Under balanced settings, our method outperforms the second-best approach by more than 1%, and this advantage increases to over 5% as the task difficulty further rises. Under the most challenging setting, our method outperformed AdaptiveFL, a state-of-the-art dynamic network method for federated learning, by 18.3% on CIFAR-100 with 100 clients under a ResNet backbone.

Keywords:

federated learning; neural architecture search; device heterogeneity

1. Introduction

The Internet of Things (IoT) [1] has become an integral part of daily life and industrial processes, generating vast amounts of distributed data from edge devices. However, leveraging these data presents significant challenges, as follows: privacy concerns arise from sensitive information such as smart home data, medical records, or industrial trade secrets; edge device data often deviate from the Independent and Identically Distributed (IID) assumption due to variations in user behavior or institutional demographics; and edge devices exhibit performance heterogeneity caused by differences in design and technological generations. To address these challenges, Federated Learning (FL) [2] has emerged as a promising approach. FL mitigates privacy risks by performing local model training on edge devices and aggregating updates on a central server, avoiding sensitive data transmission. Additionally, FL enhances usability by adapting to non-Independently and Identically Distributed (non-IID) data distributions through localized training.

Despite the potential of FL, its uniform network structures struggle to accommodate device heterogeneity and integrate lightweight techniques [3] such as pruning, sparsification, and quantization to reduce model complexity and align with varying device capabilities. While these techniques partially alleviate the device heterogeneity problem, their compression capabilities remain limited, leaving this issue only partially addressed. Recent advancements have integrated Neural Architecture Search (NAS) [4] with FL to enhance machine learning services on heterogeneous edge devices, giving rise to Federated NAS (Fed-NAS) [5]. Common approaches include searching for networks of varying sizes or designing multi-scale networks through supernets. However, these methods primarily focus on designing networks for different scales of clients without ensuring shared features among all clients for collective benefits, achieving only partial FL objectives.

The problem is that using the largest model every client can afford and forming a federated cluster with all clients are difficult to achieve simultaneously. In contrast to prior interpretations of residual connections [6], we introduce a new structural perspective that enables reformulating residual networks as NAS-compatible supernets. We observe that the residual networks can be represented as a sum of outputs from all nonlinear transformations, making these weights learnable, which enables us to convert the residual network into a NAS task. By applying transformations to each stage separately, the size of the sub-network of each client could be determined by a truncation parameter. As long as we keep the entire supernet intact, our model can freely switch between different sub-networks. Although our method introduces a small number of additional parameters corresponding to the number of residual units, these are used to dynamically normalize the structural weights during training. This enables consistent output distributions across sub-networks with different truncation lengths and effectively mitigates the distribution shift problem commonly observed in dynamic neural networks. We propose a novel reinterpretation of residual connections, showing that residual networks can be naturally viewed as continuous supernets, where each residual block contributes as a weighted component in a unified architecture space. This perspective transforms the role of residual connections from simple identity shortcuts into explicit architecture-controlling mechanisms, enabling a new class of NAS-compatible designs grounded in standard residual architectures. Unlike conventional NAS supernets where architectural decisions select between competing candidate operations, our reinterpreted supernet only allows truncation from left to right within each stage. By allowing only left-to-right truncation within each stage, every sub-network becomes a deterministic prefix of the full residual sequence, avoiding operation-level conflicts and enabling full parameter sharing—a property that conventional NAS and FedNAS designs inherently lack due to their search-space structure.

To train this supernet, we propose the Single-Path Random Quit (SPRQ) method, in which different exit lengths are randomly selected at each iteration to simulate different network sizes for various clients. SPRQ not only enhances performance but also reduces pretraining time by 24%, significantly improving training efficiency in large-scale settings. Moreover, as the SPRQ method suffers from initialization difficulties, we suggest using the two-stage optimization approach proposed in the DARTS [7] report to perform a warm-up operation. It is evident that our Supernet-Only framework is independent of FL and can be applied to most common FL frameworks.

We perform our pretraining scheme on ImageNet1K [8] and compare the performance of the Supernet-Only framework with that of using lightweight networks and clustering clients to build multiple federations, under identical client communication and computation constraints, on the CIFAR-100, CIFAR-10 [9], and CINIC-10 [10] datasets, as well as on ring-based FL and FedAvg. Our method achieves remarkable results across all datasets and federated learning methods, and its advantages become more pronounced as the generalization requirements of the dataset increase. We also validate the necessity of our proposed SPRQ method and the warm-up method in the pretraining methods. In more challenging scenarios, such as those involving CIFAR-100, large-scale federated settings, or smaller backbone networks, our method consistently achieves more than a 1% improvement in Top-1 accuracy, and it exceeds the second-best method by over 5% in all CIFAR-100 experiments under the FedAvg framework. In less constrained settings, including stronger backbones, smaller-scale federated configurations, and simpler datasets, our method performs on par with the best alternative method in each case, while maintaining a margin of more than 1% over the third-best method.

We summarize our contributions as follows. Firstly, we revisit the role of residual connections and propose a method to reformulate any network composed of residual units into a supernet for NAS tasks, effectively reinterpreting the optimization task of any residual network into a NAS task. Secondly, we propose the Single-Path Random Quit (SPRQ) pretraining method to simulate structural heterogeneity in federated learning, and we subsequently design a complete training pipeline based on it. Finally, we outline several common paths in Fed-NAS methods, validate the superiority of our proposed method across multiple datasets, and further compare it with AdaptiveFL [11], a state-of-the-art dynamic FL approach.

The remainder of this paper is organized as follows: Section 2 reviews the background and related work. Section 3 describes the motivation for our approach and details our methods, including the supernet framework and the training pipeline. Section 4 evaluates the proposed method through experiments, and Section 5 concludes our contributions and discusses future research directions.

2. Background and Related Work

This study focuses on the intersection of FL and NAS, addressing the challenges of lightweight learning and integrating lightweight learning, FL, and NAS.

2.1. Federated Learning

Federated Learning (FL) is a distributed machine learning paradigm that enables multiple devices to collaboratively train a global model without sharing raw data. A central server orchestrates the model, while clients train locally using their data and send updates—typically gradients or trained model parameters—to the server in a privacy-preserving manner. Assume there are K clients (participants), and each client k holds a local dataset

D_{k}

. The goal is to optimize the global model parameter

P

by minimizing the global loss function

L

as the weighted sum of K local loss

L_{k}

:

min_{P} L (P) = min_{P} \sum_{k = 1}^{K} \frac{| D_{k} |}{\sum_{i = 1}^{K} | D_{i} |} L_{k} (P)

(1)

FL encompasses several fundamental approaches [2], such as ring-based aggregation FL [12], gradient-based FedSGD [13] and model-based aggregation FL [14]. FL can mitigate the issues of data non-IID and data privatization. However, it generally requires different clients to use the same network architecture, making it difficult to handle the challenge of highly heterogeneous devices.

Some studies attempt to address the issue of device heterogeneity in federated learning by dynamically adjusting model complexity through pruning strategies. They send networks of different sizes to different clients, where networks of varying sizes share a portion of the channels [15], the early layers [16], or a combination of both [11,17]. Since it is necessary to train directly on networks of different sizes, they face various challenges such as optimal decision and generalization issues [18].

2.2. Neural Architecture Search

Neural Architecture Search (NAS) [4] is an automated approach for optimizing neural network structures, framed as a bi-level optimization problem. The upper-level objective minimizes the validation loss

L_{val}

parameterized by model weights

P

, while the lower-level objective minimizes the training loss

L_{train}

given the architecture hyperparameters

W

:

min_{W} L_{val} (P^{*} (W), P) s . t . P^{*} (W) = arg min_{P} L_{train} (P, W)

(2)

NAS is extensively used to design efficient lightweight neural networks tailored for resource-constrained applications. Advances in NAS have led to the development of numerous lightweight architectures, such as Effnet [19] and MobileNet [20], for resource-limited devices. The gradient-based NAS framework [7,21] is among the most popular NAS frameworks, assigning parameters to candidate modules and alternately optimizing architecture and weights during a single training process. However, ring-based NAS faces challenges such as excessive memory consumption and unreasonable context coupling, which can be effectively addressed through uniform random sampling, named Single-Path One-Shot (SPOS) [22].

2.3. Federated Neural Architecture Search (Fed-NAS)

Fed-NAS [23] research primarily addresses two key challenges in terms of performance. The first focuses on developing high-performance neural networks from diverse client data while preserving privacy. The second emphasizes efficiently using NAS to tailor architectures for varying client capabilities under privacy and computational constraints.

Some studies focus on the first challenge, performing device-aware or offline NAS on high-performance private clients while employing lightweight methods to reduce client burdens during the search process [24]. Some studies explore potential approaches to address the second challenge. However, the resulting sub-networks may become excessively downsized [25], or performance within each federation may be enhanced through client clustering methods [26,27]. Most studies fail to effectively address the issue of computational heterogeneity.

Recent research attempts to resolve these limitations by performing structure searches tailored to resource-constrained clients. For example, we optimize unique sub-networks of each client extracted from a supernet [28] or separate clients used for training or inference [29]. Other approaches partition the model into shared and personalized units, aggregating only the shared units during the federated process [30,31]. However, these methods more or less face a trade-off between federation size and network size.

3. Method

We propose a method to adapt network scales to heterogeneous client devices while maintaining compatibility across varying computational capacities in federated learning. To address this challenge, we introduce the Supernet-Only framework, which comprises the following two components: (i) a reinterpretation of residual sequential networks as a supernet, and (ii) a dedicated training pipeline.

3.1. Reinterpret Approach

In this section, we introduce our motivations and our approach to reinterpreting a residual sequential network into a supernet of a NAS task.

Consider arbitrary residual sequential network

N

with parameters

P

. Each residual unit

N_{i}

of

N

can be decomposed into two parts as follows: a nonlinear transformation

N_{i}^{Raw}

and an identity mapping

I

, as follows:

N_{i} = N_{i}^{Raw} + I .

(3)

Let

N_{l}^{pre}

represent the preceding net of the l-th layer which is the combination of the first

l - 1

layers, and adopt

N_{1}^{Pre} = N_{0}^{Pre} = N_{0}^{Raw} = I

. The full residual network

N

, composed of L layers, can then be recursively reformulated as follows:

\begin{matrix} N & = (N_{L}^{Raw} + I) \circ N_{L}^{Pre} \\ = N_{L}^{Raw} \circ N_{L}^{Pre} + N_{L}^{Pre} \\ = N_{L}^{Raw} \circ N_{L}^{Pre} + (N_{L - 1}^{Raw} + I) \circ N_{L - 1}^{Pre} \\ = N_{L}^{Raw} \circ N_{L}^{Pre} + N_{L - 1}^{Raw} \circ N_{L - 1}^{Pre} + N_{L - 1}^{Pre} \\ \dots \\ = N_{L}^{Raw} \circ N_{L}^{Pre} + N_{L - 1}^{Raw} \circ N_{L - 1}^{Pre} + \dots + N_{2}^{Raw} \circ N_{2}^{Pre} + N_{2}^{Pre} \\ = N_{L}^{Raw} \circ N_{L}^{Pre} + N_{L - 1}^{Raw} \circ N_{L - 1}^{Pre} + \dots + N_{2}^{Raw} \circ N_{2}^{Pre} + N_{1} \\ = N_{L}^{Raw} \circ N_{L}^{Pre} + N_{L - 1}^{Raw} \circ N_{L - 1}^{Pre} + \dots + N_{2}^{Raw} \circ N_{2}^{Pre} + N_{1}^{Raw} + I \\ = \sum_{l = 0}^{L} N_{l}^{Raw} \circ N_{l}^{Pre} \end{matrix}

(4)

Let

x_{l}

denote the input to model layer l. The result of the nonlinear transformation is denoted by

y_{l}^{raw} = N_{l}^{Raw} (x_{l})

, and the output of layer l is denoted as

y_{l} = N_{l} (x_{l}) = x_{l} + y_{l}^{raw}

. The input to this layer is both the output of the previous layer,

x_{l} = y_{l - 1}

, and also the output of the preceding net of this layer

x_{l} = N_{l}^{pre} (x)

. Then, the output of each term

N_{l}^{Raw} \circ N_{l}^{pre}

in Equation (4) can be represented as follows:

N_{l}^{Raw} (N_{l}^{pre} (x)) = y_{l}^{Raw}

(5)

Substituting Equation (5) into Equation (4), the output of any residual sequential network can be represented as the sum of the outputs of the nonlinear transformations of each residual unit as follows:

N (x) = \sum_{l = 0}^{L} y_{l}^{Raw}

(6)

Unlike the conventional view that networks with residual connections are sequential architectures at coarser scales, Equation (6) instead forms a parallel architecture. In discussions surrounding residual networks, some studies [32,33] suggest that low-quality nonlinear transformations can degrade, thereby minimizing their impact on the learning process. Inspired by this, we realized that there exists a hidden weight that determines the contribution ratio of each nonlinear transformation. Defining this parameter as weights, the network output can be expressed as a weighted sum of contributions from all layers

N (x) = \sum_{l = 0}^{L} W_{l} \cdot y_{l}^{Raw}

, where

W_{l}, (l = 0, 1, \dots, L)

are the weights satisfying non-negativity and normalization constraints. Furthermore, the model can be represented as follows:

N = \sum_{l = 0}^{L} W_{l} \cdot N_{l}^{Raw} \circ N_{l}^{Pre}

(7)

Since the weights

W = {W_{0}, W_{1}, \dots, W_{L}}

have been separated from the original training task, the training of this hidden parameter can be considered a meta-learning task

{arg min}_{W} L_{val} (P^{*} (W))

beyond the original network parameter optimization

{arg min}_{P} L_{train} (W^{*} (P))

.

To enforce the normalization and non-negativity constraints on the weights, we introduce the canonical link function of the Generalized Linear Model (GLM) with multinomial distribution logits as the bridge between the weights and the meta-learning parameters. Let

W^{Raw}

denote the parameters corresponding to the architecture weights. The final weights can then be computed as follows:

W = Softmax (W^{Raw}) .

(8)

Figure 1a illustrates the computational structure of each stage in the reinterpreted supernet.

We observe that the meta-learning component of this two-step optimization process corresponds to neural architecture optimization, while the original task constitutes parameter optimization. This residual network formulation naturally aligns with the bi-level optimization structure commonly adopted in neural architecture search, as represented in Equation (2).

In manually designed networks such as ConvNeXt and Transformers, as well as in searched architectures like EfficientNet, the most common macro network structure typically consists of three parts. The network begins with a preprocessing component, such as an Embedding layer or a Stem layer. This is followed by multiple stages, each composed of a downsampler followed by a feature extraction block, where the feature extraction block is entirely composed of a sequence of residual units. The network ends with one or more detection heads. If each stage is rewritten in the form of Equation (7), while keeping all other components unchanged, the sequential network can be reinterpreted as a NAS supernet.

Although the model may seem to allow the arbitrary selection of intermediate outputs from nonlinear transformations, each output, in fact, depends sequentially on all preceding layers. Once a layer is dropped, all subsequent layers become uncomputable. As a result, sub-networks can only be formed through truncation, and the structure of each truncated stage in the sub-network is shown in Figure 1b.

In the preceding text, we have demonstrated that all architectures with residual connections can be reinterpreted as a NAS problem through a single parameterization step. However, when reinterpreting an arbitrary network with residual connection as a NAS problem, we observe that two issues still need to be addressed. Firstly, modern network architectures are typically composed of multiple stages, where each stage contains one downsampling module and several residual blocks such as ConvNeXt [34] and SWIN Transformer [35]. Secondly, the omission of certain layers alters the distribution of the aggregated representations, which consequently leads to a shift in the distribution of the final softmax output as follows:

Softmax (N (x))

.

Although most networks are not entirely composed of residual units, the primary feature extraction within each stage is carried out by a sequence of residual units. Therefore, the transformation described by Equation (7) can be applied to each stage individually, with truncation performed within each stage accordingly. Assume the input to the s-th stage, which contains

L_{s}

residual units, is denoted as

x_{s}

and the output as

y_{s}

. The result of a downsampling operation serves as the input

x_{s, 1}

to the first residual unit. The input to the l-th residual unit

N_{s, l}

within this stage is represented by

x_{s, l} = N_{s, l}^{Pre} (x_{s, 1})

, where

N_{s, l}^{Pre}

denotes the preceding net of the l-th layer within stage s. Therefore, the featuring of this stage can be represented as follows:

y_{s} = \sum_{l = 1}^{L_{s}} W_{s, l} \cdot N_{s, l}^{Raw} (x_{s, l}; P_{s, l}) + W_{s, 0} \cdot x_{s, 1},

(9)

where

W_{s, l}

and

P_{s, l}

denote the weight and parameters of the l-th residual unit

N_{s, l}

in stage s, respectively.

In manually designed networks such as ConvNeXt and Transformers, as well as in searched architectures like EfficientNet, the most common macro network structure typically consists of a preprocessing component, such as an Embedding layer or a Stem layer, followed by multiple stages. Each stage is composed of a downsampler followed by a feature extraction block, and the network ends with one or more detection heads. If each stage is rewritten in the form of Equation (9), while keeping all other components unchanged, the sequential network can be reinterpreted as a supernet for a NAS task.

For the second issue, we chose to truncate the parameters of weights along with the network itself in order to reduce the magnitude of distribution shifts after dropping certain layers. Although the subsequent weights are discarded along with the truncated part of the network, the truncated weights still satisfy normalization and non-negativity, since only the weight parameters involved in subsequent computations are mapped to the probability simplex by the softmax function. Suppose in a certain training or inference scenario, stage s is truncated, retaining

l_{s}

residual units. Even if the weight parameters are not updated, the

l_{s} + 1

weights that are used are recomputed as follows:

W_{s, l} = \frac{e^{W_{s, l}}}{\sum_{j = 0}^{l_{s}} e^{W_{s, j}}}, where 0 \leq l \leq l_{s} .

(10)

The complete calculation of the truncated s-th stage is as follows:

\begin{matrix} y_{s} = \sum_{l = 1}^{l_{s}} W_{s, l} \cdot N_{s, l}^{Raw} (x_{s, l}; P_{s, l}) + W_{s, 0} \cdot x_{s, 1} \\ where & \{\begin{matrix} x_{s, 1} = {Downsampler}_{s} (x_{s}) \\ W_{s, l} = \frac{e^{W_{s, l}}}{\sum_{j = 0}^{l_{s}} e^{W_{s, j}}}, 0 \leq l \leq l_{s} \end{matrix} \end{matrix}

(11)

This process is illustrated in Figure 2, where each residual unit can be selectively retained up to

l_{s}

units per stage.

By reinterpreting each stage in this way, while keeping other modules (e.g., Stem, Head) unchanged, any modern multi-stage residual network can be reformulated as a supernet suitable for NAS.

As shown in Algorithm 1, the forward process of a truncated supernet can be represented as combination of several of the truncated stages described above.

Algorithm 1 Supernet forward algorithm

Require: Input: x, Truncation lengthes: L, Network:

N = {Stem, {N_{s}}, Head}

Each stage

N_{s}

contains:
-

{Downsampler}_{s}

- Weight parameters

W_{s}^{Raw} = {W_{s, l}^{Raw}} l = 1, \dots, L_{s}

- Nonlinear transformations

N_{s}^{Raw} = {N_{s, l}^{Raw}} l = 1, \dots, L_{s}

Ensure: y

1:: $x_{1} \leftarrow Stem (x)$ ;
2:: for s = 1 to len $(L)$ do
3:: $l_{s} \leftarrow L [s]$ ;
4:: $W_{s} [0 : l_{s}] \leftarrow Softmax (W_{s}^{Raw} [0 : l_{s}])$ ;
5:: $y_{s, 0} \leftarrow {Downsampler}_{s} (x_{s})$ ;
6:: $x_{s, 1} \leftarrow y_{s, 0}$ ;
7:: for l = 1 to $l_{s}$ do
8:: $y_{s, l} \leftarrow N_{s, l}^{Raw} (x_{s, l})$ ;
9:: $x_{s, l + 1} \leftarrow y_{s, l} + x_{s, l}$ ;
10:: end for
11:: $x_{s + 1} \leftarrow \sum_{l = 0}^{l_{s}} W_{s, l} \cdot y_{s, l}$ ;
12:: end for
13:: $y \leftarrow Head (x_{S + 1})$ ;
14:: return y;

The sub-network construction strategy can follow standard NAS- or pruning-based methods using architecture weights

W

. A simple instantiation used in our experiments is described in Section 4.1.3.

As a supplementary illustration to Algorithm 1, we present a concrete forward pass of a two-stage supernet under client-specific truncation.

Suppose Stage 1 contains two residual units, and Stage 2 contains ten. For a given client, the truncation policy selects

L = [2, 1]

, meaning that both residual units are active in Stage 1, and only one is active in Stage 2.

The input x is first processed by the Stem module to obtain

x_{1}

.

Stage 1. The input

x_{1}

is passed through the downsampler as follows:

y_{1, 0} = {Downsampler}_{1} (x_{1}), x_{1, 1} = y_{1, 0} .

(12)

The raw weights

W_{1}^{Raw} = [W_{1, 0}^{Raw}, W_{1, 1}^{Raw}, W_{1, 2}^{Raw}]

are normalized via softmax as follows:

W_{1, l} = \frac{e^{W_{1, l}^{Raw}}}{\sum_{j = 0}^{2} e^{W_{1, j}^{Raw}}}, l = 0, 1, 2 .

(13)

Then the residual path is computed as follows:

y_{1, 1} = N_{1, 1}^{Raw} (x_{1, 1}), x_{1, 2} = x_{1, 1} + y_{1, 1},

(14)

y_{1, 2} = N_{1, 2}^{Raw} (x_{1, 2}), x_{1, 3} = x_{1, 2} + y_{1, 2} .

(15)

The output of Stage 1 is the weighted sum, as follows:

x_{2} = W_{1, 0} \cdot y_{1, 0} + W_{1, 1} \cdot y_{1, 1} + W_{1, 2} \cdot y_{1, 2} .

(16)

Stage 2. The input

x_{2}

is processed similarly. Since

l_{2} = 1

, only one residual unit is used, as follows:

y_{2, 0} = {Downsampler}_{2} (x_{2}), x_{2, 1} = y_{2, 0},

(17)

W_{2, l} = \frac{e^{W_{2, l}^{Raw}}}{\sum_{j = 0}^{1} e^{W_{2, j}^{Raw}}}, l = 0, 1,

(18)

y_{2, 1} = N_{2, 1}^{Raw} (x_{2, 1}), x_{2, 2} = x_{2, 1} + y_{2, 1} .

(19)

The output of Stage 2 is as follows:

x_{3} = W_{2, 0} \cdot y_{2, 0} + W_{2, 1} \cdot y_{2, 1} .

(20)

Finally, the Head module computes the network output as follows:

y = Head (x_{3}) .

(21)

This example demonstrates how a supernet dynamically executes operations with different truncation lengths per stage, computing outputs through residual updates and stage-wise weighted aggregation, as defined in Algorithm 1.

3.2. Training Pipeline

We recommend a two-phase pretraining process prior to the FL training phase. Then, we introduce a federated sub-network generation method tailored to our specific supernet. The proposed training pipeline is shown in Figure 3.

Our supernet is constructed by reinterpreting residual sequential networks, derived either from manually designed architectures or those obtained via NAS. As a result, various training methods originally developed for manually crafted networks, such as using Adam [36] as the optimizer, can be directly applied to our reinterpreted supernet. In standard training procedures, we optimize block weights

W

and feature weights

P

jointly, without distinguishing between them.

Although our supernet retains a residual sequential structure, we decouple the training of structural weights and formulate it as a separate GLM-based learning task for structural optimization. Therefore, most training strategies designed for gradient-based NAS, including the bi-level optimization method proposed in DARTS, are also applicable to our approach.

However, standard training methods do not explicitly optimize parameterized network structures. Traditional NAS approaches typically rely on a fixed set of candidate components, such as predefined connections or blocks, which do not align well with the sub-networks produced by our truncation-based method. To address this, we propose a new training strategy tailored for sequential supernets. In each iteration, a truncation length is randomly sampled for each stage to determine the exit point of the sub-network. This mechanism enables random sampling of sub-architectures and simulates the federated learning scenario where different clients operate with different model depths. We refer to this strategy as Single-Path Random Quit (SPRQ), as it randomly selects an exit point in each stage.

From a theoretical perspective, we establish that SPRQ does not alter the optimization target in expectation. We introduce binary mask variables

m_{t, (s, l)}

to indicate whether parameter

P_{s, l}^{'} = {P_{s, l}, W_{s, l}}

is active under a randomly truncated path, and we rescale the corresponding gradients as

g_{t, (s, l)} = \frac{m_{t, (s, l)}}{p_{s, l}} \cdot g_{(s, l)}^{full}

. This ensures that the gradient estimator remains unbiased, that is,

E [g_{t, (s, l)}] = g_{(s, l)}^{full} .

(22)

Thus, the stochastic sampling process introduces no bias in gradient estimation.

The per-iteration parameter update of SPRQ is expressed as follows:

P_{t + 1}^{'} = P_{t}^{'} - diag ({\tilde{η}}_{t}) \nabla_{P^{'}} L,

(23)

where

{\tilde{η}}_{t, (s, l)} = η_{t} \cdot \frac{m_{t, (s, l)}}{p_{s, l}}

denotes the effective learning rate used to update each parameter in response to the loss

L

. Since

E [{\tilde{η}}_{t}] = η_{t} \cdot 1

, we have

E [P_{t + 1}^{'}] = P_{t}^{'} - E [diag ({\tilde{η}}_{t}) \nabla_{P^{'}} L] = P_{t}^{'} - η_{t} \cdot \nabla_{P^{'}} L,

(24)

which, in expectation, matches the behavior of SGD as follows:

P_{t + 1}^{' SGD} = P_{t}^{'} - η_{t} \cdot \nabla_{P^{'}} L .

(25)

Under the standard Robbins–Monro conditions

\sum_{t = 0}^{\infty} η_{t} = \infty

and

\sum_{t = 0}^{\infty} η_{t}^{2} < \infty

, and with the masks

{m_{t, (s, l)}}

being independent of the mini-batch noise, the Robbins–Monro theorem [37] implies that

P_{t}^{'}

almost surely converges to a stationary point of

L

.

In fact, Equation (23) matches the SGD with per-parameter learning rates. The convergence of such issues has already been extensively addressed [38,39,40,41].

Therefore, SPRQ maintains the convergence characteristics of unbiased stochastic gradient methods while offering practical scalability.

Although this method can train excellent pretraining parameters, suitable for device-heterogeneous federated learning, the initial training phase may face more difficulties in early convergence due to the dual randomness of parameters and structures. Therefore, we suggest replacing SPRQ with specialized NAS optimization methods during the warm-up stage.

In the federated learning stage, based on the characteristics of our network architecture, the sub-networks provided to different clients only differ in truncation lengths, while the model parameters, including structural and weight parameters, are completely consistent. This allows us to directly adopt most common federated learning methods without modifying the parameter aggregation approach, such as ring-based federated learning or FedAvg architectures based on parameter transmission. Even if the model sizes of different clients have quantitative differences, the above methods still apply.

In practical applications, we need to replace the detection Head and Stem parts of the reinterpreted models to adapt to the distribution differences and category number differences between pretraining datasets and federated datasets. After completing this adjustment, we only need to select the maximum network that each client can affordor send the complete supernet to clients and let them decide the length of each stage. In this way, each client can use the strongest model they can support while helping improve the overall model performance. When aggregating client model parameters, based on the early exit characteristics of our network, there is no need to distinguish between different sizes; simply synchronizing the iterated parameters or accumulated gradients in sub-networks to the global model is sufficient.

In centralized federated topologies, communication efficiency can be maintained by aggregating only the subset of parameters relevant to each local model during each round of training, resulting in a communication cost comparable to that of standard FL approaches. In contrast, decentralized federated topologies inherently require more extensive parameter exchange, often involving the entire supernet. Nevertheless, owing to the compact design of our supernet relative to conventional supernet architectures, the associated communication overhead is significantly reduced.

4. Experiments

In this section, we design two sets of experiments to evaluate the impact of different components of the approach. Our method primarily consists of two parts, the Supernet-Only FL framework and a complementary pretraining approach.

4.1. Evaluation of Supernet-Only Framework

We use ConvNeXt Small [34] as the backbone model and design our supernet based on it. ConvNeXt Small consists of four stages; the first, second, and fourth stages have three layers each, while the third stage contains 27 layers. All stages are modified according to the proposed method. However, due to the large size of Stage 3, sub-network pruning is only applied to this stage, while other stages remain unpruned during the FL phase.

To ensure fairness in comparison, a unified pretraining process is conducted for all federated methods in this part of the experiment.

The pretraining lasts for 5 epochs with a learning rate of 1 ×

10^{- 5}

as warm-up, followed by 20 epochs with a learning rate of 1 ×

10^{- 4}

. The batch size is set to 96, and the Nadam optimizer is used with default settings.

During the pretraining phase, all supernet parameters are fully optimized. In the federated training phase, the first and the last convolutional layer are replaced before federated transfer learning. After training, the performance of all client models is evaluated on the validation set to verify the effectiveness of the method. Although ConvNeXt Small is used as the example model, our method can be generalized to any deep neural networks (DNNs) [42] with skip connections like ResNet [43], ConvNeXt [34], or Transformer [44].

We categorize existing NAS and Fed-NAS strategies under resource-constrained conditions as follows:

Max Trainable: The largest sub-network that all clients can train is used as the federated model, equivalent to traditional FL.
Max Runnable: The smallest sub-network that all clients can use for inference is selected as the federated model. Clients unable to train only perform inference, while capable clients participate in training.
No Federated: Each client uses the strongest sub-network it can support without FL.
Grouped: Clients are divided into three different scale categories, with each group using the largest sub-network it can support for federated learning.
Peaches: The cells of sub-network are separated into private cells and public ones. Only public cells are shared among the federal model.

In addition, we conduct comparative experiments under the distribution strategy of AdaptiveFL to validate the effectiveness of the Supernet-Only framework against dynamic network strategies for FL. Max Trainable and Max Runnable represent cases of lightweight global FL, No Federated represents device-aware NAS for personalized search, and Grouped represents clustering methods for Fed-NAS problems. These control groups aim to show the trade-off between using stronger models and sharing parameters across a larger range without our novel method.

Our method has high adaptability, so we test it under the following two standard federated learning structures: ring-based federated learning and the FedAvg framework.

We assume that the computational capabilities of clients are uniformly distributed between the smallest and largest sizes of residual units in the backbone network. The weakest clients can support sub-networks with all variable-sized units set to 1, while the strongest clients can handle the full backbone as it appears before transformation in Section 3.1. For simplicity, we also assume the training cost is twice the inference cost.

4.1.1. Evaluation of the Supernet-Only Framework Under Ring-Based Federated Learning

We first evaluate the Supernet-Only framework using ring-based FL [12] under simulated resource-constrained conditions. The supernet is pretrained on ImageNet1K [8] using the method recommended in Section 3.2. Federated datasets are divided into fourteen parts to simulate private datasets owned by fourteen clients with different performance levels. The variable module counts allowed for each client simulate the computational constraints. In the start of the federated learning phase, the first client of each federated cluster obtain the model from a provider. Then, each client trains the model using private data and passes the updated model to the next client following training until the last client passes the model to the first one. Thus, the cycle goes on, simulating ring-based FL.

Federated training is conducted for 20 epochs with a batch size of 64, using a learning rate of 1 ×

10^{- 4}

and the Nadam optimizer with default settings. Following the related work, we select the public datasets CIFAR-100, CIFAR-10 [9], and CINIC-10 [10] for evaluation. CIFAR-10 and CIFAR-100 each provide 50,000 training images and 10,000 validation images, while CINIC-10 provides 90,000 images for training and 90,000 for validation. To simulate online learning, the random generator is not turned off during inference, leading to some performance fluctuations in the Max Trainable and Max Runnable groups. After the completion of the federated learning process, the mean and variance of the accuracy for all clients on the validation set are as shown in Table 1. The mean reflects the federated training performance of the model on heterogeneous clients under various data conditions. The variance of accuracy among all clients indicates the variability in performance within the group.

As shown in Table 1, our method achieved the best performance across all datasets, benefiting from its enhanced federated training capabilities. In the experiments on CIFAR-100, which has only 600 samples per class and presents greater challenges for model generalization, our method demonstrated the most significant improvement. However, in the federated training on CIFAR-10 and CINIC-10, the gaps between our method and the Max Trainable group are narrow, whereas the performance gaps between our method and the Max Runnable are significant. Across all datasets, Peaches, along with both the No Federated and Grouped settings, exhibited inferior overall performance due to its limited ability to share desensitized information. Furthermore, the No Federated group, which had the least sharing, performed the worst, despite each client being able to use the strongest model it could afford.

4.1.2. Evaluation of the Supernet-Only Framework Under Federated Averaging (FedAvg)

We then evaluate the Supernet-Only framework using FedAvg [14] under similar evaluation settings with experiments on the ring-based federated model. FedAvg is the most commonly used FL algorithm to train a global model across distributed clients without centralizing data. The supernet is pretrained and we distribute pruned sub-networks to clients as needed. At the end of each federated training round, the server aggregates the model by averaging the parameter updates from clients. In experiments with smaller models, the parameter update process is relatively simple. However, in the supernet-only framework, client models of different sizes share all parameters. To address this, only the parts used by each client are aggregated. For weights used by c clients in a certain layer, server-side parameter updates are averaged across these c clients. After the completion of the federated learning process, the mean and variance of the accuracy for all clients on the validation set are as shown in Table 2.

As shown in Table 2, our method achieved promising results. In the experiments on CIFAR-100, our methods outperform the secondary best Max Runnable group by more than

5 %

. In the FedAvg experiments on other datasets, the results differed significantly from those observed in ring-based federated learning. Specifically, for CIFAR-10 and CINIC-10, our method showed a narrow disadvantage compared to the Max Runnable group, while Max Trainable struggled due to overfitting and constraints imposed by the limited model size. In these two datasets, where data are relatively abundant, the advantage of shared knowledge of Supernet-Only may not be fully realized.

To validate that our method maintains competitive performance and does not degrade relative to other methods when the federated scale varies, we performed FedAvg experiments on federated clusters consisting of 128 clients. In order to simulate a larger federate of 128 clients, we employed devices with higher GPU memory capacity, which consequently introduced some changes to the computational environment. The other experiments’ settings are the same as those described in Section 4.1.2. After the completion of the federated learning process, the mean and variance of the accuracy for all clients on the validation set are shown in Table 3.

As the scale of the federation increases, the convergence of federated learning approaches that of SGD, and the federated learning performance of the model surpasses those of models of a moderate scale. At the same time, the No Federated method, which does not involve federation, performs poorly.

To further evaluate the reliability and significance of the observed performance differences, we applied the Wilcoxon signed-rank test to compare Supernet-Only with two competitive baselines, Max Runnable and Max Trainable, using client-wise paired results. This non-parametric test is suitable for comparing matched samples without assuming a specific distribution. In the experiment, the test accuracy obtained under Supernet-Only was paired with the corresponding value from each baseline for the same client. Since only 79 out of 128 clients completed training in the Max Runnable setting due to resource limitations, this comparison was limited to the 79 matched clients. In contrast, the comparison involving Max Trainable used results from all 128 clients.

For each dataset and baseline, we calculated the Wilcoxon test statistic W, which is the sum of the ranks of the signed differences in performance. We also reported the standardized score Z, which indicates the magnitude and direction of the difference, the raw p-value, and the Bonferroni-corrected p-value for controlling the family-wise error rate across multiple tests. The raw p-value reflects the probability of observing the given result under the null hypothesis in a single test, while the corrected p-value adjusts for multiple comparisons to reduce the likelihood of Type I errors. In addition, we computed the effect size

r = Z / \sqrt{N}

, where N represents the number of clients included in the comparison. The effect size describes the practical relevance of the result; values of r greater than

0.3

typically indicate a large effect. These indicators together provide a comprehensive assessment of both the statistical and practical significance of the observed differences, as shown in Table 4.

The results indicate that Supernet-Only consistently and significantly outperforms Max Runnable on all of the following three datasets: CIFAR-10, CIFAR-100, and CINIC-10. The Bonferroni-corrected p-values in these comparisons are all below

0.001

, with a value of

2.17 \times 10^{- 13}

on CIFAR-10, which reflects strong statistical evidence. The corresponding effect sizes are large, with r values ranging from

0.78

to

0.86

. These results show that the observed improvements are not only statistically significant but also of substantial practical importance at the client level.

When compared with Max Trainable, Supernet-Only also demonstrates statistically significant advantages across all datasets. Although the differences in this case are smaller than those observed against Max Runnable, they remain meaningful. All corrected p-values are below

0.01

, such as

1.89 \times 10^{- 3}

on CIFAR-10. The effect sizes range from

0.45

to

0.67

, which corresponds to medium-to-large practical effects. These results confirm that Supernet-Only offers both wider applicability across clients and consistently higher predictive performance.

Reviewing Table 1, Table 2 and Table 3, across all different sizes of federations and federate frameworks, our method consistently outperforms the suboptimal method by more than

1 %

on the CIFAR-100 dataset, and this superiority exceeds 5% when using FedAvg, as shown in Table 2 and Table 3. Experimental results on the CIFAR-10 and CINIC-10 datasets demonstrate that our approach consistently ranks among the top performers. Compared to other state-of-the-art methods, our approach either maintains a certain advantage or only shows a minimal disadvantage, regardless of the performance changes in these methods. This indicates that our method is highly competitive and robust.

4.1.3. Evaluation of the Supernet-Only Framework Under FedAvg with Small Backbone

To validate the performance differences between our method and the use of dynamic networks, we implemented the Supernet-Only framework following the latest AdaptiveFL approach. In the reports on AdaptiveFL [11], two smaller networks are utilized, VGG without residual connections and ResNet18 with residual connections. Dynamic neural networks are developed based on these models to meet the requirements of AdaptiveFL. Considering that the VGG network represents an early architecture lacking residual connections, modifications can only be applied to ResNet18 to satisfy the criteria for network mutation as defined in Section 3.1.

To enhance the flexibility of sub-network generation, we decompose the original residual blocks in ResNet18 such that each individual Conv-BN-ReLU layer forms a separate residual unit, resulting in 16 residual units in total. Based on the architecture weights of each stage in the Supernet, we apply softmax normalization and rank the four candidate operations within each stage. We then generate a series of sub-networks by progressively reducing the number of retained operations per stage from four to one, following the ranked priorities. We adopt the federated training setting from AdaptiveFL while performing our experiments based on the client distribution described in Section 4.1. In our setting, the average client size is significantly smaller, and the size difference among clients is substantially greater compared to the report of AdaptiveFL. After training through the federated phase, the accuracy achieved by different methods across various datasets is summarized in Table 5. Table 6 provides a detailed comparison between the largest-sized network and the strongest network from AdaptiveFL and the Supernet-Only framework without pretraining.

In AdaptiveFL, networks of different sizes are separately created, which leads to convergence issues if they are pretrained. In contrast, our sub-networks are generated by a supernet, enabling our method to benefit from pretraining. Even without the unique advantage of pretraining, our method still demonstrates superior performance, benefiting from module selection and optimization guided by structure weights

W

.

In AdaptiveFL, as the difference in model size is significant, the performance degradation of the full-sized network compared to the best-performing pruned network exceeds

10 %

. In comparison, our method effectively minimizes the discrepancy between the optimal method and the largest network size, maintaining a stable performance. Once the pretraining is activated, the sub-network with the largest size always outperforms the smaller ones within the Supernet-Only framework.

Overall, we observe that as the generalization difficulty of the training setting increases, the advantage of our method becomes more pronounced. Additionally, clustering federated devices into multiple federated subgroups does not appear to yield superior results compared to lightweight model selection. However, it is important to note that our backbone model, ConvNeXt, is inherently strong. In our experiments, model representational capacity may not be significantly constrained, which worked to the advantage of both the Max Runnable and Max Trainable groups.

4.2. Evaluation of the Training Pipeline

In Section 3.2, we propose a recommended pretraining routine for FL. In this section, we validate the effectiveness of our proposed pretraining method. We evaluate three different training methods during the warm-up and pretraining phases, including the standard training method used in conventional neural network training, the two-step optimization method proposed in DARTS, and our proposed SPRQ method that simulates federated client group behavior.

Normal: In this set, we pretrain the supernet using conventional network training methods and common optimizers, treating architecture parameters as regular network parameters.
DARTS: In this set, we use the two-step optimization method from DARTS [7], alternately training weights and architecture parameters on training and validation datasets.
SPRQ: In this set, we generate sub-networks by randomly skipping layers during pretraining to simulate client diversity and computational constraints (detailed in Section 3.2).

Next, we evaluate their respective performance in the pretraining and warm-up phases to verify the necessity of using the DARTS warm-up followed by the SPRQ pretraining process.

4.2.1. Evaluation of the Main Pretraining Phase

In Section 3.2, we recommend the SPRQ method for the main pretraining phase. Fixing the warm-up training method as DARTS, we used SPRQ, DARTS, and standard optimization methods during the pretraining phase. Pretraining and warm-up were conducted on the ImageNet1k dataset, with the supernet’s loss and accuracy on the validation set during pretraining shown in Figure 4a,b. Subsequently, FL was performed on the CIFAR-100, CIFAR-10, and CINIC-10 datasets, and the accuracy of different clients was evaluated using the respective validation sets. The mean and standard deviation of client accuracies are shown as error bars in Figure 4c,d.

As shown in Figure 4a,b, during the pretraining phase, SPRQ did not outperform DARTS or Normal in terms of loss and accuracy. The Normal group showed the best convergence ability and final precision. However, despite the additional randomness introduced, SPRQ demonstrated convergence comparable to DARTS and Normal and achieved the highest average accuracy across all clients with the smallest standard deviation in FL, as shown in Figure 4c,d. Conversely, the Normal training method performed poorly during FL. The Normal group had the lowest average client performance and the largest performance standard deviation.

Furthermore, as shown in Table 7, SPRQ consumed significantly less time during pretraining due to frequent early exits.

4.2.2. Evaluation of Warm-Up Phase

In Section 3.2, we suggest using DARTS for warm-up training to overcome the challenges of initial SPRQ training. To validate its effectiveness, we pretrained the network on the ImageNet1K dataset with different warm-up methods and then performed pretraining using the SPRQ method on the same dataset. Loss and accuracy on the ImageNet1K validation set during the pretraining phase are shown in Figure 5a,b. Subsequently, FL was conducted with the same setup as in the evaluation of the main pretraining phase, and the federated average accuracy and standard deviation under different datasets and methods are shown in Figure 5c,d.

As shown in Figure 5a,b, the DARTS warm-up method outperformed Normal and SPRQ in terms of convergence speed and final performance. The error bars in Figure 5c,d demonstrate that DARTS provided the best average performance across all clients, with the lowest standard deviation in validation accuracy. This validates our recommendation to use a DARTS-based training method during the warm-up phase.

5. Discussion

Existing Fed-NAS research faces the challenge of balancing comprehensive data anonymization and sharing with the full utilization of client performance.

We observe that the output of sequential networks composed of residual units with skip connections can be represented as a weighted sum of the outputs from each nonlinear transformation. This allows us to reinterpret any network composed of residual units into a NAS task by making these learnable weights independent as a metal-earning task. Based on this insight, we introduce a new methodology for converting any residual-based sequential neural network into a NAS task, enabling flexible and efficient architecture search. Even though standard NAS and non-NAS methods can be employed to train the transformed supernet, its particular structure led us to discover a more fitting training strategy, referred to as Single-Path Random Quit (SPRQ). Our method facilitates the extensive sharing of anonymized data while maximizing the performance of each client, considering their varying computational capabilities. This effectively eliminates the previously frustrating trade-off, as shown in Table 8.

Extensive experiments on CIFAR-10, CIFAR-100, and CINIC-10 demonstrate the effectiveness of the proposed Supernet-Only framework for federated learning. Under both ring-based and FedAvg training schemes, the framework generally achieves superior or at least comparable performance to strong baselines across diverse and resource-constrained client settings. On CIFAR-100, the Supernet-Only framework improves average accuracy by more than 5% over the Max Trainable baseline in the FedAvg setup. Statistical analysis using the Wilcoxon signed-rank test confirms that these gains are significant at the client level, with consistently low p-values and large effect sizes. In large-scale federated learning with 128 clients, the framework maintains strong performance, reaching an average accuracy of 77.84% on CIFAR-100 and outperforming the next-best method by over 5%. On CIFAR-10 and CINIC-10, while performance is sometimes comparable to Max Runnable, Supernet-Only remains among the top-performing methods in all cases. When applied to smaller backbones such as ResNet-18 under the AdaptiveFL configuration, it yields accuracy improvements of up to 18.3%. Additionally, the proposed training pipeline, which combines DARTS-based warm-up with SPRQ-based pretraining, achieves the best overall federated performance with reduced variance and cuts pretraining time by approximately 24% compared to other methods. These findings indicate that the Supernet-Only framework offers a practical, scalable, and reliable solution for federated learning in computational heterogeneous environments.

Despite the promising results, our study has several limitations. First, the experiments are conducted on commonly-used public benchmarks such as CIFAR-10, CIFAR-100, and CINIC-10. While these datasets facilitate comparability with prior work, they do not fully capture the complexity, heterogeneity, and privacy challenges of real-world federated learning applications, such as those in healthcare, mobile devices, or industrial systems. Second, our current framework assumes a stable communication environment with synchronous client participation. It does not consider real-world factors such as limited bandwidth, intermittent connectivity, or client dropouts, which can significantly impact performance in practical federated learning scenarios. Third, our reinterpretation of residual networks as supernets under the neural architecture search paradigm is, to the best of our knowledge, the first of its kind. However, this exploration remains preliminary and limited in scope. Further investigation is needed to fully understand the expressiveness, generalization behavior, and potential drawbacks of this formulation. Finally, all evaluations are conducted in simulated environments. The absence of deployment on actual mobile or edge devices means that important system-level aspects such as latency, memory usage, and energy consumption have not yet been evaluated.

Future work will focus on addressing these limitations to improve the real-world applicability of our approach. Specifically, we plan to evaluate our method using more realistic federated datasets that better reflect cross-device and cross-silo environments, such as MedMNIST [45], OpenFL benchmarks [46], and other LEAF benchmarks [47], which capture non-IID characteristics, device heterogeneity, and practical deployment constraints. We also intend to extend the framework to handle communication-limited and unreliable scenarios, including asynchronous client updates, adaptive participation, and the mitigation of straggler clients. Additionally, we aim to further explore the reinterpretation of residual networks as supernets by exploring more expressive weighting mechanisms, analyzing their convergence behavior in heterogeneous conditions, and reintegrating them with advanced neural architecture search strategies. Finally, we will investigate the deployment of our method on actual hardware platforms such as smartphones, edge devices, or IoT systems in order to assess system-level performance, resource consumption, and feasibility in real-world applications.

6. Conclusions

In this paper, we propose the Supernet-Only framework, which reinterprets residual sequential networks as supernets in a NAS task by expressing their outputs as weighted sums of nonlinear transformations, where the weights are optimized through a meta-learning process. By introducing stage-wise truncation and softmax-normalized architectural weights, our framework enables dynamic sub-network generation based on client-specific constraints, allowing modern multi-stage residual architectures to be efficiently adapted to federated learning environments. We design a training pipeline consisting of a DARTS-based warm-up phase, SPRQ-based pretraining, and federated fine-tuning with sub-network truncation, enabling efficient supernet training across heterogeneous clients while preserving compatibility with standard federated learning protocols.

The experimental results confirm that the proposed Supernet-Only framework achieves consistently strong performance across diverse federated learning scenarios and datasets, outperforming existing baselines in both large-scale and small-backbone settings while benefiting from an effective DARTS-SPRQ pretraining pipeline.

However, limitations remain, such as reliance on standard benchmark datasets, assumptions of stable communication, and the lack of deployment on real-world devices. Future work will focus on addressing these limitations by introducing realistic federated datasets, supporting asynchronous and unreliable environments, enhancing the generalization of the supernet formulation, and evaluating the system-level performance on real hardware.

Author Contributions

Conceptualization, Y.C. and D.C.; methodology, Y.C.; software, Y.C.; validation, Y.C.; formal analysis, Y.C., D.C. and C.Z.; data curation, Y.C.; writing—original draft preparation, Y.C.; writing—review and editing, Y.C., D.C. and C.Z.; visualization, Y.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research is funded by the Natural Science Foundation of Guangxi under Grant No. 2025GXNSFAA069540 and the Research Capacity Improvement Project of Young Researcher under Grant No. 2024KY0017.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available datasets: Imagenet in [8] at (https://www.image-net.org/, accessed on 13 May 2025), CINIC-10 in [10] at (https://github.com/BayesWatch/cinic-10, accessed on 13 May 2025), and CIFAR-10, CIFAR-100 in [9] at (https://www.cs.toronto.edu/~kriz/cifar.html, accessed on 13 May 2025).

Acknowledgments

We would like to express our gratitude to the School of Computer, Electronics and Information, Guangxi University; the Guangxi Universities Key Laboratory of Parallel Distributed and Intelligent Computing; and the High Performance Computing Platform of Guangxi University for their resources and support.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

IoT	Internet of Things
IID	Independent and Identically Distributed
FL	Federated Learning
non-IID	non-Independent and Identically Distributed
NAS	Neural Architecture Search
Fed-NAS	Federated Neural Architecture Search
SPRQ	Single-Path Random Quit
DARTS	Differentiable ARchiTecture Search

References

Aouedi, O.; Vu, T.H.; Sacco, A.; Nguyen, D.C.; Piamrat, K.; Marchetto, G.; Pham, Q.V. A survey on intelligent Internet of Things: Applications, security, privacy, and future directions. IEEE Commun. Surv. Tutor. 2024, 27, 1238–1292. [Google Scholar] [CrossRef]
Wen, J.; Zhang, Z.; Lan, Y.; Cui, Z.; Cai, J.; Zhang, W. A survey on federated learning: Challenges and applications. Int. J. Mach. Learn. Cybern. 2023, 14, 513–535. [Google Scholar] [CrossRef] [PubMed]
Liu, H.I.; Galindo, M.; Xie, H.; Wong, L.K.; Shuai, H.H.; Li, Y.H.; Cheng, W.H. Lightweight Deep Learning for Resource-Constrained Environments: A Survey. ACM Comput. Surv. 2024, 56, 267. [Google Scholar] [CrossRef]
Heuillet, A.; Nasser, A.; Arioui, H.; Tabia, H. Efficient automation of neural network design: A survey on differentiable neural architecture search. ACM Comput. Surv. 2024, 56, 1–36. [Google Scholar] [CrossRef]
Zhu, H.; Zhang, H.; Jin, Y. From federated learning to federated neural architecture search: A survey. Complex Intell. Syst. 2021, 7, 639–657. [Google Scholar] [CrossRef]
Xu, G.; Wang, X.; Wu, X.; Leng, X.; Xu, Y. Development of residual learning in deep neural networks for computer vision: A survey. Eng. Appl. Artif. Intell. 2025, 142, 109890. [Google Scholar] [CrossRef]
Liu, H.; Simonyan, K.; Yang, Y. Darts: Differentiable architecture search. arXiv 2018, arXiv:1806.09055. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Li, F. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Krizhevsky, A.; Hinton, G. Learning multiple layers of features from tiny images. In Handbook of Systemic Autoimmune Diseases; Technical Report; Elsevier: Amsterdam, The Netherlands, 2009. [Google Scholar]
Darlow, L.N.; Crowley, E.J.; Antoniou, A.; Storkey, A.J. Cinic-10 is not imagenet or cifar-10. arXiv 2018, arXiv:1810.03505. [Google Scholar]
Jia, C.; Hu, M.; Chen, Z.; Yang, Y.; Xie, X.; Liu, Y.; Chen, M. AdaptiveFL: Adaptive heterogeneous federated learning for resource-constrained AIoT systems. In Proceedings of the 61st ACM/IEEE Design Automation Conference, Francisco, CA, USA, 23–27 June 2024; pp. 1–6. [Google Scholar]
Lee, J.W.; Oh, J.; Lim, S.; Yun, S.Y.; Lee, J.G. Tornadoaggregate: Accurate and scalable federated learning via the ring-based architecture. arXiv 2020, arXiv:2012.03214. [Google Scholar]
Konečnỳ, J.; McMahan, H.B.; Ramage, D.; Richtárik, P. Federated optimization: Distributed machine learning for on-device intelligence. arXiv 2016, arXiv:1610.02527. [Google Scholar]
McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 20–22 April 2017; pp. 1273–1282. [Google Scholar]
Diao, E.; Ding, J.; Tarokh, V. Heterofl: Computation and communication efficient federated learning for heterogeneous clients. arXiv 2020, arXiv:2010.01264. [Google Scholar]
Kim, M.; Yu, S.; Kim, S.; Moon, S.M. Depthfl: Depthwise federated learning for heterogeneous clients. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Ilhan, F.; Su, G.; Liu, L. Scalefl: Resource-adaptive federated learning with heterogeneous clients. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 24532–24541. [Google Scholar]
Han, Y.; Huang, G.; Song, S.; Yang, L.; Wang, H.; Wang, Y. Dynamic Neural Networks: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 7436–7456. [Google Scholar] [CrossRef]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Howard, A.G. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Wu, B.; Dai, X.; Zhang, P.; Wang, Y.; Sun, F.; Wu, Y.; Tian, Y.; Vajda, P.; Jia, Y.; Keutzer, K. Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10734–10742. [Google Scholar]
Guo, Z.; Zhang, X.; Mu, H.; Heng, W.; Liu, Z.; Wei, Y.; Sun, J. Single path one-shot neural architecture search with uniform sampling. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 544–560. [Google Scholar]
Khan, S.; Rizwan, A.; Khan, A.N.; Ali, M.; Ahmed, R.; Kim, D.H. A multi-perspective revisit to the optimization methods of Neural Architecture Search and Hyper-parameter optimization for non-federated and federated learning environments. Comput. Electr. Eng. 2023, 110, 108867. [Google Scholar] [CrossRef]
Pan, Z.; Hu, L.; Tang, W.; Li, J.; He, Y.; Liu, Z. Privacy-Preserving Multi-Granular Federated Neural Architecture Search—A General Framework. IEEE Trans. Knowl. Data Eng. 2021, 35, 2975–2986. [Google Scholar] [CrossRef]
Yuan, J.; Xu, M.; Zhao, Y.; Bian, K.; Huang, G.; Liu, X.; Wang, S. Federated neural architecture search. arXiv 2020, arXiv:2002.06352. [Google Scholar]
Laskaridis, S.; Fernandez-Marques, J.; Dudziak, Ł. Cross-device Federated Architecture Search. In Proceedings of the Workshop on Federated Learning: Recent Advances and New Challenges (in Conjunction with NeurIPS 2022), New Orleans, LA, USA, 2 December 2022. [Google Scholar]
Liu, J.; Yan, J.; Xu, H.; Wang, Z.; Huang, J.; Xu, Y. Finch: Enhancing federated learning with hierarchical neural architecture search. IEEE Trans. Mob. Comput. 2023, 23, 6012–6026. [Google Scholar] [CrossRef]
Yu, S.; Muñoz, J.P.; Jannesari, A. Resource-Aware Heterogeneous Federated Learning with Specialized Local Models. In Proceedings of the European Conference on Parallel Processing, Madrid, Spain, 26–30 August 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 389–403. [Google Scholar]
Khare, A.; Agrawal, A.; Annavajjala, A.; Behnam, P.; Lee, M.; Latapie, H.; Tumanov, A. SuperFedNAS: Cost-Efficient Federated Neural Architecture Search for On-device Inference. In Proceedings of the European Conference on Computer Vision, Shanghai, China, 1–2 January 2025; Springer: Berlin/Heidelberg, Germany, 2025; pp. 161–179. [Google Scholar]
Hoang, M.; Kingsford, C. Personalized neural architecture search for federated learning. In Proceedings of the 1st NeurIPS Workshop on New Frontiers in Federated Learning (NFFL 2021), Virtual, 13 December 2021. [Google Scholar]
Yan, J.; Liu, J.; Xu, H.; Wang, Z.; Qiao, C. Peaches: Personalized federated learning with neural architecture search in edge computing. IEEE Trans. Mob. Comput. 2024, 23, 10296–10312. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Identity mappings in deep residual networks. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 630–645. [Google Scholar]
Lin, H.; Jegelka, S. Resnet with one-neuron hidden layers is a universal approximator. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2018; Volume 31. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Laskaridis, S.; Kouris, A.; Lane, N.D. Adaptive Inference through Early-Exit Networks: Design, Challenges and Directions. In Proceedings of the 5th International Workshop on Embedded and Mobile Deep Learning, EMDL’21, New York, NY, USA, 25 June 2021; pp. 1–6. [Google Scholar] [CrossRef]
Robbins, H.; Monro, S. A stochastic approximation method. In The Annals of Mathematical Statistics; Institute of Mathematical Statistics: Hayward, CA, USA, 1951; pp. 400–407. [Google Scholar]
Reddi, S.J.; Kale, S.; Kumar, S. On the Convergence of Adam and Beyond. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Rattray, M.; Saad, D.; Amari, S.i. Natural gradient descent for on-line learning. Phys. Rev. Lett. 1998, 81, 5461. [Google Scholar] [CrossRef]
Duchi, J.; Hazan, E.; Singer, Y. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. J. Mach. Learn. Res. 2011, 12, 2121–2159. [Google Scholar]
Loizou, N.; Vaswani, S.; Hadj Laradji, I.; Lacoste-Julien, S. Stochastic Polyak Step-Size for SGD: An Adaptive Learning Rate for Fast Convergence. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics, Virtual, 13–15 April 2021; Volume 130, pp. 1306–1314. [Google Scholar]
Liu, W.; Wang, Z.; Liu, X.; Zeng, N.; Liu, Y.; Alsaadi, F.E. A survey of deep neural network architectures and their applications. Neurocomputing 2017, 234, 11–26. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.u.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: New York, NY, USA, 2017; Volume 30. [Google Scholar]
Yang, J.; Shi, R.; Ni, B. MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis. In Proceedings of the IEEE 18th International Symposium on Biomedical Imaging (ISBI), Nice, France, 13–16 April 2021; pp. 191–195. [Google Scholar]
Foley, P.; Sheller, M.J.; Edwards, B.; Pati, S.; Riviera, W.; Sharma, M.; Moorthy, P.N.; Wang, S.H.; Martin, J.; Mirhaji, P.; et al. OpenFL: The open federated learning library. Phys. Med. Biol. 2022, 67, 214001. [Google Scholar] [CrossRef] [PubMed]
Caldas, S.; Duddu, S.M.K.; Wu, P.; Li, T.; Konečnỳ, J.; McMahan, H.B.; Smith, V.; Talwalkar, A. Leaf: A benchmark for federated settings. arXiv 2018, arXiv:1812.01097. [Google Scholar]

Figure 1. Representation of the supernet stage structure: (a) Full stage structure, where each stage consists of a downsampler followed by a sequence of residual layers (L1, L2, and L3), and all outputs are aggregated into the stage output. (b) Truncation mechanism used to generate sub-networks, where once a layer (e.g., L3) is skipped, all subsequent layers within the same stage become uncomputable and are excluded from the network path. This design enforces a strict sequential dependency within each stage, aligning the structure with a truncatable residual supernet. The dotted lines represent edges and nodes that are not used during the current computation.

Figure 2. Computation flow within stage

N_{s}

of the supernet: The inputs are feature map

x_{s}

(from the previous stage), truncation length

l_{s}

(determined by client-side constraints), and raw weights

W_{s}

(i.e.,

W_{s}^{Raw}

).

D_{s}

, S, and

O_{s, l}

denote the downsampler

{Downsampler}_{s}

, the weight normalization function Softmax, and the nonlinear transformation

N_{s, l}^{Raw}

, respectively. The dashed box highlights a single residual unit, repeating

l_{s}

times. This stage corresponds to lines 4–11 in Algorithm 1.

Figure 2. Computation flow within stage

N_{s}

of the supernet: The inputs are feature map

x_{s}

(from the previous stage), truncation length

l_{s}

(determined by client-side constraints), and raw weights

W_{s}

(i.e.,

W_{s}^{Raw}

).

D_{s}

, S, and

O_{s, l}

denote the downsampler

{Downsampler}_{s}

, the weight normalization function Softmax, and the nonlinear transformation

N_{s, l}^{Raw}

, respectively. The dashed box highlights a single residual unit, repeating

l_{s}

times. This stage corresponds to lines 4–11 in Algorithm 1.

Figure 3. Training pipeline overview: The process consists of the following three main stages: pretraining, federated training, and client-level validation. In the pretraining stage (left), the supernet is first warmed up using the DARTS method, which alternates between architecture and weight updates. It is then further trained using the SPRQ strategy, which applies random layer skipping to simulate client diversity and improve generalization. To accommodate the input/output differences between pretraining and federated tasks, the Stem and Head modules are replaced before federated training, and each client receives a subset of the supernet for local training on its own data. Finally, in the validation stage (right), the trained models are evaluated individually on each client’s local validation set to assess performance in heterogeneous settings.

Figure 4. Evaluation results of the pretraining phase: (a) Validation loss curves on the ImageNet1k dataset across training epochs during the main pretraining phase, where DARTS warm-up is applied and three pretraining methods are compared. The x-axis represents training epochs, and the y-axis shows validation loss. (b) The corresponding validation accuracy curves, where the x-axis represents training epochs and the y-axis shows validation accuracy. (c,d) The mean and standard deviation of client accuracies on CIFAR-100, CINIC-10, and CIFAR-10, using FedSGD and FedAvg, respectively. Markers and colors indicate the methods used: blue dots for DARTS, green triangles for SPRQ, and orange squares for the Standard method.

Figure 5. Evaluation results of the warm-up phase: (a) Loss curves on the ImageNet1k dataset across training epochs during the main pretraining phase, with different warm-up strategies applied beforehand. The x-axis represents training epochs, and the y-axis shows validation loss. (b) The corresponding validation accuracy curves, where the x-axis indicates epochs and the y-axis indicates validation accuracy. (c,d) The mean and standard deviation of client accuracies on CIFAR-100, CINIC-10, and CIFAR-10 using FedSGD and FedAvg, respectively. Markers and colors indicate the warm-up strategies used: blue dots for DARTS, green triangles for SPRQ, and orange squares for Normal.

Table 1. Accuracy statistic under ring-based FL: Every value illustrates the mean validation accuracy and the corresponding standard deviation across different clients under federated learning settings and dataset correspondents (the best and second-best results are in bold and underlined, respectively).

Dataset	CIFAR-100	CIFAR-10	CINIC-10
Supernet-Only	$78.57 \pm 1.78$	$95.12 \pm 0.84$	$88.09 \pm 0.81$
Max Trainable	$\underset{̲}{77.28 \pm 0.23}$	$\underset{̲}{94.93 \pm 0.19}$	$\underset{̲}{88.07 \pm 0.15}$
Max Runnable	$76.61 \pm 0.19$	$93.89 \pm 0.20$	$86.82 \pm 0.25$
No Federated	$56.01 \pm 1.66$	$88.95 \pm 0.63$	$79.60 \pm 0.71$
Grouped	$67.54 \pm 1.67$	$91.49 \pm 0.70$	$83.46 \pm 0.45$
Peaches	$63.64 \pm 16.68$	$85.17 \pm 19.67$	$78.70 \pm 17.60$

Table 2. Accuracy statistic under FedAvg: Every value illustrates the mean validation accuracies and their corresponding standard deviations across different clients (the best and second-best results are in bold and underlined, respectively).

Dataset	CIFAR-100	CIFAR-10	CINIC-10
Supernet-Only	$67.67 \pm 1.43$	$\underset{̲}{91.07 \pm 0.61}$	$\underset{̲}{83.22 \pm 0.62}$
Max Trainable	$58.69 \pm 0.43$	$88.87 \pm 0.37$	$81.12 \pm 0.27$
Max Runnable	$\underset{̲}{62.31 \pm 0.45}$	$91.43 \pm 0.37$	$83.30 \pm 0.30$
No Federated	$54.35 \pm 0.90$	$88.37 \pm 1.14$	$79.89 \pm 0.60$
Grouped	$59.22 \pm 1.89$	$90.09 \pm 0.76$	$81.93 \pm 0.96$
Peaches	$43.37 \pm 1.37$	$72.28 \pm 2.58$	$72.37 \pm 0.94$

Table 3. Accuracy statistic under FedAvg for the large-scale federal model: Every value illustrates the mean validation accuracies and their corresponding standard deviations across 128 clients. This experiment was trained for 50 epochs using an H20-96G GPU, whereas other experiments under the ConvNeXt configuration were conducted for 20 epochs on a 4090-24G GPU, due to the increased memory requirements from large-scale client simulation and the slower convergence rate in this setting (the best and second-best results are in bold and underlined, respectively).

Dataset	CIFAR-100	CIFAR-10	CINIC-10
Supernet-Only	$77.84 \pm 0.91$	$94.70 \pm 0.62$	$87.12 \pm 0.88$
Max Trainable	$70.88 \pm 0.89$	$92.72 \pm 0.59$	$84.13 \pm 0.54$
Max Runnable	$\underset{̲}{72.42 \pm 0.86}$	$\underset{̲}{93.41 \pm 0.52}$	$\underset{̲}{85.75 \pm 0.67}$
No Federated	$13.78 \pm 8.08$	$69.49 \pm 12.00$	$63.44 \pm 6.64$
Grouped	$71.45 \pm 1.00$	$93.38 \pm 0.53$	$83.95 \pm 1.01$
Peaches	$57.43 \pm 10.18$	$85.79 \pm 15.58$	$76.89 \pm 3.31$

Table 4. Wilcoxon signed-rank test results for client-level accuracy comparisons: This table reports statistical test results comparing the performance of our proposed method (Supernet-Only) against two baseline strategies (Max Runnable and Max Trainable) across three datasets. Reported metrics include the Wilcoxon test statistic W, standardized statistic Z, raw p-value, Bonferroni-corrected p-value, and effect size r. Significant differences (

p < 0.01

) and medium-to-large effect sizes (

r \geq 0.45

) indicate meaningful performance differences at the client level.

Table 4. Wilcoxon signed-rank test results for client-level accuracy comparisons: This table reports statistical test results comparing the performance of our proposed method (Supernet-Only) against two baseline strategies (Max Runnable and Max Trainable) across three datasets. Reported metrics include the Wilcoxon test statistic W, standardized statistic Z, raw p-value, Bonferroni-corrected p-value, and effect size r. Significant differences (

p < 0.01

) and medium-to-large effect sizes (

r \geq 0.45

) indicate meaningful performance differences at the client level.

Dataset	Compared to	W	Z	Raw p	Corr. p	r
CIFAR-10	Max Runnable	30	−7.57	1.08 × $10^{- 13}$	2.17 × $10^{- 13}$	0.85
CIFAR-10	Max Trainable	1986	−5.09	9.43 × $10^{- 4}$	1.89 × $10^{- 3}$	0.45
CIFAR-100	Max Runnable	159	−6.94	1.41 × $10^{- 11}$	2.82 × $10^{- 11}$	0.78
CIFAR-100	Max Trainable	959	−7.54	9.35 × $10^{- 13}$	1.87 × $10^{- 12}$	0.67
CINIC-10	Max Runnable	8	−7.68	3.30 × $10^{- 14}$	6.61 × $10^{- 14}$	0.86
CINIC-10	Max Trainable	920	−7.63	2.54 × $10^{- 11}$	5.08 × $10^{- 11}$	0.67

Table 5. Accuracy statistic under the AdaptiveFL and Supernet-Only frameworks with or without pretraining. Every value illustrates the mean validation accuracies and their corresponding standard deviations across 13 model sizes and 100 clients under federated setting of AdaptiveFL (the best and second-best results are in bold and underlined, respectively).

Dataset	CIFAR-100	CIFAR-10	CINIC-10
Supernet-Only (pretrained)	$57.76 \pm 7.11$	$87.78 \pm 5.76$	$79.22 \pm 6.14$
Supernet-Only (not pretrained)	$\underset{̲}{49.56 \pm 12.31}$	$\underset{̲}{77.55 \pm 18.16}$	$\underset{̲}{73.89 \pm 7.28}$
AdaptiveFL	$31.26 \pm 19.39$	$67.77 \pm 25.36$	$57.92 \pm 21.68$

Table 6. Comparison of performance gaps between the largest and best structures. Comparisons of the full-size and best-performing models in AdaptiveFL with max and best sub-networks in the Supernet-Only framework without pretraining. Full and max refer to the largest network configurations (both matching ResNet-18), while best indicates the sub-network or pruned network achieving the highest test accuracy.

Dataset	AdaptiveFL		Supernet Only
Dataset	Best	Full	Best	Max
CIFAR-100	$53.75$	$14.03$	$56.91$	$53.34$
CIFAR-10	$87.36$	$76.64$	$88.81$	$88.64$
CINIC-10	$74.72$	$59.15$	$78.05$	$75.97$

Table 7. Pretraining time (20 epochs, hours). Each value represents the time cost statistics during the 20 epochs of the pretraining phase on an RTX 4090 GPU.

	DARTS	Standard	SPRQ
Time Spent (h)	25.40	25.66	19.34

Table 8. Comparison of Fed-NAS methods. This table compares representative Fed-NAS methods in terms of their ability to maximize federation, computation utilization, and model sharing. Maximizing Federation refers to all clients forming a single federation. Maximizing Computation means each client can select the best-performing sub-network according to its own computational capacity. Full Sharing indicates that all parameters used by clients are included in the federated synchronization. A ✔ indicates that the method supports the corresponding feature.

Method	Maximize Federation	Maximize Computation	Full Sharing
Supernet-Only	✔	✔	✔
Max Trainable	✔		✔
Max Runnable			✔
No Federated		✔
Grouped			✔
Peaches	✔	✔

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, Y.; Chen, D.; Zhong, C. A Supernet-Only Framework for Federated Learning in Computationally Heterogeneous Scenarios. Appl. Sci. 2025, 15, 5666. https://doi.org/10.3390/app15105666

AMA Style

Chen Y, Chen D, Zhong C. A Supernet-Only Framework for Federated Learning in Computationally Heterogeneous Scenarios. Applied Sciences. 2025; 15(10):5666. https://doi.org/10.3390/app15105666

Chicago/Turabian Style

Chen, Yu, Danyang Chen, and Cheng Zhong. 2025. "A Supernet-Only Framework for Federated Learning in Computationally Heterogeneous Scenarios" Applied Sciences 15, no. 10: 5666. https://doi.org/10.3390/app15105666

APA Style

Chen, Y., Chen, D., & Zhong, C. (2025). A Supernet-Only Framework for Federated Learning in Computationally Heterogeneous Scenarios. Applied Sciences, 15(10), 5666. https://doi.org/10.3390/app15105666

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Supernet-Only Framework for Federated Learning in Computationally Heterogeneous Scenarios

Abstract

1. Introduction

2. Background and Related Work

2.1. Federated Learning

2.2. Neural Architecture Search

2.3. Federated Neural Architecture Search (Fed-NAS)

3. Method

3.1. Reinterpret Approach

3.2. Training Pipeline

4. Experiments

4.1. Evaluation of Supernet-Only Framework

4.1.1. Evaluation of the Supernet-Only Framework Under Ring-Based Federated Learning

4.1.2. Evaluation of the Supernet-Only Framework Under Federated Averaging (FedAvg)

4.1.3. Evaluation of the Supernet-Only Framework Under FedAvg with Small Backbone

4.2. Evaluation of the Training Pipeline

4.2.1. Evaluation of the Main Pretraining Phase

4.2.2. Evaluation of Warm-Up Phase

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI