Previous Article in Journal
Multi-Objective Optimal Scheduling of Park-Level Integrated Energy System Based on Trust Region Policy Optimization Algorithm
Previous Article in Special Issue
A Federated Learning Framework with Attention Mechanism and Gradient Compression for Time-Series Strategy Modeling
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

FedRegNAS: Regime-Aware Federated Neural Architecture Search for Privacy-Preserving Stock Price Forecasting

1
School of Business, Suzhou University of Science and Technology, Suzhou 215000, China
2
School of Art, Nanjing University of Information Science and Technology, Nanjing 210044, China
*
Authors to whom correspondence should be addressed.
Electronics 2025, 14(24), 4902; https://doi.org/10.3390/electronics14244902
Submission received: 11 November 2025 / Revised: 8 December 2025 / Accepted: 9 December 2025 / Published: 12 December 2025
(This article belongs to the Special Issue Security and Privacy in Distributed Machine Learning)

Abstract

Financial time series are heterogeneous, nonstationary, and dispersed across institutions that cannot share raw data. While federated learning enables collaborative modeling under privacy constraints, fixed architectures struggle to accommodate cross-market drift and device-resource diversity; conversely, existing neural architecture search techniques presume centralized data and typically ignore communication, latency, and privacy budgets. This paper introduces FedRegNAS, a regime-aware federated NAS framework that jointly optimizes forecasting accuracy, communication cost, and on-device latency under user-level ( ε , δ ) -differential privacy. FedRegNAS trains a shared temporal supernet composed of candidate operators (dilated temporal convolutions, gated recurrent units, and attention blocks) with regime-conditioned gating and lightweight market-aware personalization. Clients perform differentiable architecture updates locally via Gumbel-Softmax and mirror descent; the server aggregates architecture distributions through Dirichlet barycenters with participation-weighted trust, while model weights are combined by adaptive, staleness-robust federated averaging. A risk-sensitive objective emphasizes downside errors and integrates transaction-cost-aware profit terms. We further inject calibrated noise into architecture gradients to decouple privacy leakage from weight updates and schedule search-to-train phases to reduce communication. Across three real-world equity datasets, FedRegNAS improves directional accuracy by 3–7 percentage points and Sharpe ratio by 18–32%. Ablations highlight the importance of regime gating and barycentric aggregation, and analyses outline convergence of the architecture mirror-descent under standard smoothness assumptions. FedRegNAS yields adaptive, privacy-aware architectures that translate into materially better trading-relevant forecasts without centralizing data.

1. Introduction

Financial time series are noisy, heavy-tailed, and regime-dependent. The most informative records are siloed across institutions that cannot share raw data because of regulation and competitive sensitivity. Federated learning (FL) offers a way to collaborate without centralizing data [1,2], yet fixed neural architectures often underperform under cross-market heterogeneity, device-resource diversity, and asynchronous participation. Neural architecture search (NAS) can adapt models through differentiable relaxations [3] and parameter-sharing or stochastic variants [4,5,6], but most NAS assumes centralized data and ignores communication, latency, and user-level privacy. Differential privacy (DP) provides principled protection [7,8], although naively adding noise to all channels can severely hurt accuracy, especially with non-IID clients and partial participation. In financial prediction, modern sequence models—temporal convolutions [9], attention-based forecasters [10], probabilistic RNNs [11], and microstructure-aware CNNs [12]—must also handle latent regime shifts [13] and data-dependent gating effects [14]. This work targets the intersection of these challenges for stock-return forecasting, aligning with the special-issue theme of privacy-preserving, resource-aware learning in finance.
In real FL deployments, institutions differ in universe coverage, liquidity regimes, feature engineering, and hardware capabilities. This heterogeneity leads to several issues: (i) architecture mismatch, where a single backbone cannot serve all clients well; (ii) staleness and partial participation, which bias aggregated updates and increase variance; (iii) tight resource envelopes, where communication budgets and on-device latency caps must be respected; and (iv) privacy constraints, where user-level DP is required not only on weights but also on architecture signals that can leak sensitive patterns. Financial objectives also extend beyond mean squared error to directional accuracy and trading-aware risk metrics under frictions. These factors call for a federated method that can search over architectures, adapt to regimes, control resources, and provide calibrated privacy guarantees [15,16,17].
We introduce FedRegNAS, a regime-aware federated NAS framework that learns both network weights θ and architecture logits α from decentralized data while respecting communication and latency proxies and enforcing user-level DP. The core model is a temporal supernet composed of dilated temporal convolutions, gated recurrent units, and causal attention blocks. A lightweight gate g ψ produces a regime simplex z t that modulates per-operator adapters, so the supernet can react to market conditions without duplicating the entire network. Clients update θ with stochastic gradients and optimize α via mirror descent on the logit parameterization using a Gumbel–Softmax relaxation. Their updates are clipped and noised with decoupled DP budgets for weights and architectures (ADDP, Adaptive Decoupled Differential Privacy). On the server, we aggregate weights using staleness- and trust-aware coefficients and fuse client architecture beliefs through a KL barycenter B , a logarithmic opinion–style rule robust to non-IID participation [18]. A search-to-train curriculum anneals the sampling temperature, discretizes α to a sparse subgraph, and fine-tunes θ before deploying compact market-aware adapters. Resource use is managed via differentiable proxies for communication and on-device latency, which are integrated into the training objective.
FedRegNAS is built around four principles: (P1) adaptivity to regimes through z t -conditioned operators that preserve parameter efficiency; (P2) stability under heterogeneity via KL-barycentric architecture fusion and similarity-based trust weighting of weight updates; (P3) privacy without over-noising by decoupling DP budgets for θ and α using Rényi composition; and (P4) resource-awareness that co-optimizes accuracy with communication and latency using lazy architecture synchronization, quantization, and sparsification compatible with clipping-first DP.
On federated daily and intraday equity benchmarks, FedRegNAS consistently improves over strong FL baselines. Relative to a Transformer trained with FedProx, we observe 3–5% lower RMSE, gains of 1.6–2.3 percentage points in directional accuracy, and roughly 10–12% higher Sharpe ratio across datasets. At similar accuracy, mean client upload is reduced by about 43% versus a federated LSTM and by about 70% versus the Transformer baseline. The discovered architectures also cut median on-device inference latency by roughly 47% (for example, 3.6 ms vs. 6.8 ms on a mobile-class CPU). With user-level DP around ( ε w , ε α ) ( 2.0 , 1.0 ) at δ = 10 5 , decoupled budgets preserve accuracy relative to a single coupled budget, supporting the view that architecture-channel noise is the dominant privacy lever.
Contributions. (1) A regime-gated federated supernet for financial forecasting that conditions operator adapters on latent regimes and searches over temporal primitives suited to markets [9,10,11]; (2) a barycentric architecture aggregation rule that combines client architecture distributions via a KL barycenter, improving stability and sparsity preservation under non-IID data [18]; (3) ADDP, a decoupled user-level DP mechanismthat separately allocates noise to weights and architectures, analyzed with RDP accounting [7,8]; (4) an integrated risk- and resource-aware objective that trades off forecasting error, trading utility, and communication/latency; (5) a practical search-to-train curriculum with personalization that yields compact, low-latency deployments and strong out-of-sample performance. Together, these elements enable adaptive, privacy-aware forecasting under realistic federated constraints.
Organization. Section 2 reviews federated optimization, privacy, NAS, and finance-specific forecasting. Section 3 introduces notation, objectives, privacy accounting, and resource models. Section 4 details the regime-gated supernet, mirror-descent search, KL-barycentric aggregation, ADDP, and the communication stack. Section 6 reports accuracy, trading, efficiency, ablations, and privacy–performance trade-offs. Section 7 concludes.

2. Related Work

Our work lies at the intersection of federated optimization, differential privacy, neural architecture search (NAS), and financial time-series forecasting. We highlight the most relevant lines and position FedRegNAS among them.

2.1. Federated Learning Under Heterogeneity, Staleness, and Robustness

Federated averaging [1,19] established on-device collaborative training with periodic model aggregation. Later work improved stability under client drift by adding proximal regularization and control variates, as in FedProx [20], SCAFFOLD [21], and adaptive server optimizers [22]. Surveys summarize open problems around non-IID data, partial participation, and systems constraints [2]. Asynchronous updates and staleness are addressed by staleness-aware weighting [23], while robustness to adversarial or noisy clients motivates aggregators like Krum [24], Byzantine-robust estimation [25], and trust bootstrapping [26]. Communication efficiency is pursued via quantization and sparsification, e.g., QSGD [27], deep gradient compression [28], and memory-based top-k sparsified SGD [29]. Our server-side weighting combines staleness penalties with similarity-based trust, and our communication stack composes clipping, quantization, and sparsification in a way that remains compatible with user-level privacy guarantees.

2.2. Differential Privacy in Deep and Federated Learning

Differentially private SGD (DP-SGD) introduced per-iteration clipping, Gaussian noise, and advanced accounting for deep learning [7]. Rényi differential privacy (RDP) [8] enables tighter composition bounds and convenient subsampling analysis, and is now standard for user-level guarantees in FL. We follow this line but decouple privacy budgets across weight and architecture channels. More noise is assigned to the more sensitive architecture gradients, while weight updates use moderated noise to preserve accuracy. This complements existing DP-FL mechanisms that rely on clipping-and-noise and RDP-style accountants [7,8].

2.3. Neural Architecture Search and Differentiable Relaxations

NAS has evolved from reinforcement learning and evolutionary strategies to more efficient one-shot relaxations and gradient-based search [30]. ENAS shares parameters across architectures to reduce search cost [4]. DARTS relaxes discrete choices to a continuous simplex and uses gradient descent [3]. Stochastic NAS variants rely on the Gumbel–Softmax/Concrete distribution for differentiable sampling [5,31,32]. Hardware- and resource-aware methods bias search toward efficient operators or prune expensive ones [6]. Most of this literature assumes centralized data and ignores privacy, communication, and device heterogeneity. In contrast, our federated, regime-aware supernet, KL-style barycentric aggregation (akin to logarithmic opinion pools [18]), and decoupled DP budgets explicitly link architecture decisions to client regimes, resource constraints, and privacy.

2.4. Time-Series Forecasting and Financial Prediction

Deep sequence models such as temporal convolutional networks [9], temporal fusion transformers [10], and probabilistic forecasters like DeepAR [11] have improved generic time-series prediction. In high-frequency finance, specialized architectures exploit limit-order-book structure [12]. Market nonstationarity is often modeled via regime switching [13] or mixtures of experts with data-dependent gating [14]. Our regime-gated, federated NAS framework unifies these ideas by searching for operator choices and adapters conditioned on latent regimes, while optimizing a risk-sensitive objective that is tied to trading performance.
Prior work provides essential building blocks: federated optimization [1,22], privacy accounting [7,8], differentiable NAS [3,6,31], and regime-aware sequence modeling [13,14]. FedRegNAS integrates these strands and adds three main novelties: (i) a regime-gated federated supernet tailored to financial series; (ii) a KL-inspired barycentric aggregation of client architecture distributions, related to log-opinion pooling [18]; and (iii) architecture/weight DP decoupling with resource-aware curricula. Together, these components enable adaptive, privacy-preserving forecasting under realistic communication budgets.

2.5. Federated NAS and DP-Aware NAS

Neural architecture search has also been integrated with federated learning. Survey work on federated NAS [33] and specific algorithms such as FedNAS [34], DP-FNAS [35], and personalized federated NAS like PerFedRLNAS [36] combine architecture search with client-side training. These frameworks mostly target vision benchmarks, operate under generic non-IID partitions, and search over convolutional backbones or layer configurations. They typically optimize accuracy and, in some cases, resource usage. In contrast, FedRegNAS is designed for stock-return forecasting. It operates on a regime-gated temporal supernet, searches over time-series operators and horizon-specific modules, and optimizes an objective that jointly reflects forecasting performance, communication and latency proxies, and decoupled user-level DP budgets [37]. This design allows us to study the interaction between NAS and FL in a financial setting, rather than on image benchmarks, and to analyze how architecture search behaves under decoupled DP.
Our framework is also related to architecture-level personalization methods that adapt model structures to heterogeneous clients. Hypernetwork-based FL approaches, such as the filter-aware personalization framework of Yang et al. [38], generate client-specific filters or blocks from a metanetwork on top of a shared backbone. Both this line of work and FedRegNAS pursue personalized architectures, but they differ in mechanism. Hypernetwork methods keep a fixed global architecture and personalize through generated weights. FedRegNAS instead performs federated NAS over a regime-aware temporal supernet, aggregates probabilistic architecture distributions across clients, and then applies lightweight adapters for personalization. Beyond NAS, personalized FL frameworks in medical imaging, such as SCAN-PhysFed for low-dose CT denoising with large language models [39], show the benefit of incorporating domain structure and auxiliary signals into FL. Our method follows this principle in the financial domain by using latent regimes, financial objectives, and DP constraints to guide architecture search and aggregation.

3. Preliminaries

We formalize the forecasting task, modeling assumptions, federated optimization protocol, and the privacy/resource/evaluation models used throughout. The notation introduced here is reused consistently in subsequent sections.

3.1. Data, Predictive Task, and Regime Representation

We consider N institutions (clients) indexed by i { 1 , , N } . Client i holds a private dataset.
D i = ( x t L + 1 : t ( i ) , y t ( i ) ) t T i ,
where x t L + 1 : t ( i ) R L × d denotes a lagged feature window and y t ( i ) R is the next-period log-return. With price p t ( i ) , define r t ( i ) = log ( p t ( i ) / p t 1 ( i ) ) and the prediction target y t ( i ) = r t + 1 ( i ) . Features may include technical indicators, realized volatility, order-book statistics, and cross-asset signals; feature construction is client-specific and remains on-device.
In addition, we summarize the latent market state by a regime vector z t Δ R over R regimes (e.g., low-volatility, high-volatility, event-driven). A lightweight gate g ψ : R L × d Δ R estimates z t = g ψ ( x t L + 1 : t ) . Unless otherwise specified, regime-related parameters (e.g., per-regime adapters) are included in the global weight vector θ .

3.2. Supernet Parameterization and Architecture Relaxation

The global predictor is a parametric function f θ , α that maps x t L + 1 : t ( i ) to y ^ t ( i ) , where θ are network weights and α are architecture logits. We adopt a one-shot supernet with M decision points; at decision m, the candidate operator set is O m = { O m , j } j = 1 J m (e.g., dilated temporal convolutions, GRU cells, causal self-attention, identity). For each module m, we maintain logits α m R J m and define the corresponding operator probabilities via a softmax
p m = softmax ( α m ) Δ J m ,
so that p m , j is the probability weight assigned to operator O m , j . Stacking yields α = ( α 1 , , α M ) for logits and p = ( p 1 , , p M ) for probabilities.
During search, mixed operations use a Gumbel–Softmax relaxation over the logits:
MixOp m ( h ) = j = 1 J m z m , j O m , j ( h ) , z m = softmax α m + g m τ ,
where g m , j iid Gumbel ( 0 , 1 ) and τ > 0 is a temperature. Post-search, the learned probabilities p m are discretized (e.g., top-1 or top-s per decision) and the resulting subnetwork is fine-tuned.

3.3. Federated Optimization and Architecture Fusion

Training proceeds over rounds k = 0 , 1 , , K 1 . At round k, the server samples S k { 1 , , N } and broadcasts ( θ k , α k ) . Each selected client performs E local steps on minibatches from D i , producing update deltas ( Δ θ i , Δ α i ) and an effective weight ν i (e.g., proportional to local sample count). Let τ i k 0 denote the staleness (round lag) of client i and ϕ ( · ) a nonincreasing penalty (e.g., ϕ ( τ ) = 1 / ( 1 + τ ) ). The server aggregates as
θ k + 1 = θ k + i S k ω i k Δ θ i , ω i k = ν i ϕ ( τ i k ) j S k ν j ϕ ( τ j k ) ,
α k + 1 = B { α i } i S k , { w i k } , w i k ν i ϕ ( τ i k ) ,
where (2) is a staleness-robust variant of federated averaging and (3) aggregates architecture distributions via a barycentric operator. Instead of directly optimizing architecture probabilities on the simplex, each client optimizes unconstrained logits and obtains probabilities via a softmax. For module m, let p m = softmax ( α m ) Δ J m denote the probability vector over its J m operators, where α m R J m are the corresponding logits. Mirror descent on the simplex with KL geometry is conveniently implemented by taking a gradient step in the logit space and then re-normalizing with a softmax. Concretely, at local step t on client i, we update
α m , j t + 1 = α m , j t η α α m , j L i ^ , p m t + 1 = softmax ( α m t + 1 ) ,
where L i ^ is a minibatch gradient estimate and η α > 0 is the architecture stepsize. This update is equivalent to mirror descent with KL divergence on the probability simplex (as in exponentiated-gradient methods), but is implemented in practice by a simple gradient step on the logits followed by a softmax, which keeps p m on the simplex without manual renormalization. Here, α m denotes the global architecture logits maintained on the server, while α m ( k ) denotes the corresponding client-specific logits updated locally on client k. We use p m = softmax ( α m ) for the associated probabilities, and θ / θ ( k ) for shared/local model weights, so that local and global variables are clearly distinguished in all subsequent equations). Given client distributions { α i } and weights { w i } with i w i = 1 , we use the KL barycenter (geometric mean on the simplex):
B { α i } , { w i } m , j = i α m , j ( i )   w i = 1 J m i α m , ( i ) w i ,
the solution of arg min α m Δ J m i w i D KL ( α m α m ( i ) ) , which preserves emerging sparsity and is robust to non-IID participation.

3.4. Privacy, Resource Proxies, and Evaluation Criteria

We enforce user-level ( ε , δ ) -DP with Gaussian mechanisms on weights and architectures, allowing decoupled budgets ( ε w , δ w ) and ( ε α , δ α ) . Per-client sanitization clips updates and adds noise:
Δ θ ˜ i = clip ( Δ θ i , C w ) , Δ α ˜ i = clip ( Δ α i , C α ) ,
Δ θ ^ i = Δ θ ˜ i + N ( 0 , σ w 2 C w 2 I ) , Δ α ^ i = Δ α ˜ i + N ( 0 , σ α 2 C α 2 I ) ,
with noise multipliers σ w , σ α > 0 . Composition across rounds follows a moments or Rényi DP accountant; the server aggregates only sanitized updates via (2) and (5).
Let bytes ( θ , α ) denote the serialized client upload and lat ( α ; HW ) the on-device latency (predicted by additive operator latencies) for hardware profile HW . A differentiable proxy controls resources:
R ( α ) = λ comm · E [ bytes ( θ , α ) ] + λ lat · E HW [ lat ( α ; HW ) ] ,
with nonnegative weights λ comm , λ lat . For prediction y ^ t = f θ , α ( x t L + 1 : t ; z t ) and true y t , define the asymmetric squared loss
asym ( y t , y ^ t ) = λ ( y t y ^ t ) 2 , y t y ^ t < 0 , λ + ( y t y ^ t ) 2 , y t y ^ t 0 , λ > λ + > 0 .
A threshold policy π ϑ generates u t { 1 , 0 , 1 } ; the per-period strategy return with transaction cost γ 0 is
r ˜ t = u t y t γ | u t u t 1 | .
Client i minimizes
L i ( θ , α ) = E asym ( y t , y ^ t ) λ pnl · E [ r ˜ t ] + λ cvar · CVaR β ( r ˜ t ) + λ res · R ( α ) ,
with λ pnl , λ cvar , λ res 0 and tail level β ( 0 , 1 ) .

4. Methodology

We present FedRegNAS, a regime-aware, privacy-preserving federated neural architecture search method for stock-return forecasting. The method couples a regime-gated temporal supernet with mirror-descent architecture updates, trust- and staleness-aware aggregation, and a search-to-train curriculum that respects device and communication budgets. We retain the notation introduced in Section 3 and reorganize the methodology into four coherent subsections.
In general, FedRegNAS combines three key components: (i) a regime-gated temporal supernet, (ii) KL-barycentric aggregation of client-wise architecture distributions, and (iii) adaptive decoupled differential privacy (ADDP) for weights and architectures. Regime gating reuses standard sequence operators (e.g., recurrent and attention blocks) but organizes them via a learned regime encoder so that different latent market regimes select different subpaths; this differs from prior federated NAS works that typically search over global vision backbones without regime structure. KL-barycentric aggregation adapts ideas from distributional averaging to architecture logits, replacing the simple parameter averaging used in conventional FL and federated NAS baselines, and is specifically designed to stabilize architecture search under non-IID clients. Finally, ADDP reinterprets user-level DP accounting in a decoupled manner, assigning separate noise scales and budgets to weight updates and architecture gradients, in contrast to earlier DP-aware NAS methods that typically use a single privacy mechanism for all parameters. The structure of the paper in FedRegNAS is shown in Figure 1.

4.1. Problem Formulation and Bilevel NAS Reduction

Let L i ( θ , α ) be the client loss in (11). The global objective aggregates client risks with sampling weights π i > 0 , i π i = 1 :
min θ , α Δ F ( θ , α ) : = i = 1 N π i E ( x , y ) D i L i ( θ , α ) ,
optionally augmented by a resource regularizer (e.g., FLOPs/latency penalty) with coefficient λ res as in the smoothed objective F λ used in the analysis.
We adopt the standard one-shot NAS relaxation: during search, the network is overparameterized and ( θ , α ) are optimized jointly; after selection, α is discretized to a sparse architecture and θ is fine-tuned. Concretely, the architecture parameters α = { α m } m define simplex weights over J m candidate operators at decision m. Post-search, we discretize by top-1 (or top-s) selection using empirical usage frequencies of the soft selectors (cf. Section 4.4), and then continue federated training with α frozen.
Because architecture quality depends on weights trained on heterogeneous clients (and vice versa), we treat α as a slow variable and θ as a fast variable. Clients update θ every round via stochastic gradients of L i , while α is updated less frequently by mirror descent on the product of simplices (cf. (4)). This reduction stabilizes search under partial participation and reduces communication by transmitting architecture updates only intermittently (every H α rounds).

4.2. Regime-Gated Temporal Supernet

To embed nonstationarity, we modulate operators by a regime simplex z t = g ψ ( x t L + 1 : t ) Δ R , where g ψ is a locally computed encoder over the most recent feature window. The encoder’s output weights regime-specific adapters while keeping most parameters shared. We instantiate the regime encoder g ψ as a lightweight GRU-based temporal encoder followed by a linear projection and softmax, mapping past returns and features x 1 : t to a point z t on a simplex over R latent regimes. Unless otherwise stated, g ψ is shared across markets and trained end-to-end jointly with the forecasting model and architecture parameters, without using any explicit volatility or volume labels. The regime cardinality R is treated as a model-selection hyperparameter and chosen on a held-out validation set.
On this basis, for a hidden activation h at decision m, we define
RG - MixOp m h ; x t L + 1 : t = j = 1 J m z m , j r = 1 R z t [ r ] A m , j , r O m , j ( h ) , z m = softmax log α m + g m τ
where O m , j is the j-th candidate operator, and A m , j , r ( u ) = γ m , j , r u + β m , j , r is a lightweight, per-regime affine adapter contributing to θ . The temperature τ is annealed (high → low) during search to move from exploration to exploitation.
RG-MixOps enable conditional computation over latent regimes by combining two sets of weights: the time-varying regime vector z t Δ R 1 produced by the regime encoder g ψ ( x 1 : t ) , and the module-wise operator probabilities z m Δ J m 1 obtained from the Gumbel–Softmax over logits α m . For a given module m and hidden state h at time t, we first apply each primitive operator O m , j ( h ) , transform it through regime-specific adapters A m , j , r ( · ) , and weight these outputs by z t [ r ] ; the resulting regime mixture is then combined across operators using z m , j . Thus RG-MixOp outputs a doubly weighted sum over regimes r and operators j, with z t [ r ] gating A m , j , r inside each operator and z m , j mixing the J m operators. Since z t is computed locally on each client and never transmitted, regime states remain private by construction.

4.3. Client-Side Optimization with Decoupled Privacy

On round k, client i S k receives ( θ k , α k , τ k ) and performs E local steps on minibatches B D i :
θ θ η w θ L i ( θ , α ; B ) , α m , j α m , j exp η α α m , j L i ^ = 1 J m α m , exp η α α m , L i ^ ( mirror descent ; ( 4 ) ) .
After local optimization, the client forms deltas Δ θ i = θ θ k and Δ α i = α α k . We apply separate clipping and Gaussian mechanisms to weights and architecture channels:
Δ θ ^ i = clip ( Δ θ i , C w ) + N ( 0 , σ w 2 C w 2 I ) , Δ α ^ i = clip ( Δ α i , C α ) + N ( 0 , σ α 2 C α 2 I ) .
This ADDP decoupling allows ( ε w , δ w ) and ( ε α , δ α ) to be tuned independently via RDP/moments accounting. In practice we allocate more privacy (larger σ α and/or fewer uploads) to the more sensitive architecture channel.
Weights Δ θ ^ i are uploaded every round, and architectures Δ α ^ i only when k mod H α = 0 . This asymmetric schedule matches the bilevel reduction, reduces bandwidth, and limits cumulative ε α . Low-rank sketches can be applied locally to Δ θ i before sanitization to reduce payload without changing the privacy analysis. Clipping introduces bounded bias that is traded against variance from Gaussian noise; thresholds C w and C α are tuned to keep clipping rates modest. To mitigate variance amplification at low τ , we use gradient normalization within RG-MixOps and damp η α relative to η w . Typical robust ranges are observed empirically: C w [ 0.5 , 3 ] , C α [ 0.1 , 1 ] , σ w [ 0.5 , 2 ] , σ α [ 1 , 4 ] , H α { 2 , 4 , 8 } .

4.4. Server Aggregation and Search-to-Train Curriculum

4.4.1. Trust/Staleness Weighting and Weight Aggregation

Upon receiving sanitized updates, the server computes similarity-aware weights
ω i k ν i ϕ ( τ i k ) ρ i k , ρ i k = exp λ trust · cos ( Δ θ ^ i , Δ θ ¯ ) j S k exp λ trust · cos ( Δ θ ^ j , Δ θ ¯ ) , Δ θ ¯ = j S k ν j Δ θ ^ j ,
where cosine similarity is computed on low-rank sketches to reduce bandwidth and noise variance. The global weights update is
θ k + 1 = θ k + i S k ω i k Δ θ ^ i .
Architecture parameters are aggregated with the KL barycenter (cf. (5)), which yields the normalized geometric mean per decision m:
α k + 1 , m [ j ] = i S k α i , m [ j ] ω i k = 1 J m i S k α i , m [ ] ω i k , j = 1 , , J m ,
followed by smoothing
α ˜ k + 1 = ( 1 λ bar ) α k + λ bar α k + 1 , λ bar ( 0 , 1 ] .
If no new α is received (rounds with k mod H α 0 ), the server keeps α k + 1 = α k and continues temperature annealing independently.

4.4.2. Three-Phase Search-to-Train Curriculum

We schedule the following:
Phase I (warm search) with Gumbel–Softmax temperature annealing
τ k + 1 = max ( τ min , η τ τ k ) , η τ < 1 ,
weight uploads every round, and architecture uploads every H α rounds;
Phase II (selection and sparse retraining), where we discretize α by top-1 or top-s per decision using frequency statistics of z m and then freeze α while continuing federated optimization of θ (often with larger E and smaller η w );
Phase III (market-aware personalization), where each client deploys private lightweight parameters η ( i ) (e.g., per-regime adapters A or a small head), forming θ ( i ) = θ η ( i ) and optimizing
min η ( i ) E ( x , y ) D i L i θ η ( i ) , α disc + μ η ( i ) 2 2 + λ prox θ θ ref 2 2 ,
with η ( i ) remaining local (no uplink). Phase II halts the growth of the architecture-channel privacy ledger; Phase III typically requires no further communication.

5. Theoretical Analysis

This section establishes the theoretical underpinnings of FedRegNAS, providing convergence guarantees under bounded staleness and differential privacy, as well as a rigorous characterization of its user-level privacy accounting. We proceed by first presenting the assumptions that govern the analysis, then deriving the main convergence theorem, followed by detailed discussions on the effects of clipping, trust weighting, and privacy composition.

5.1. Assumptions and Preliminaries

The following assumptions extend those stated in Section 3, formalizing the analytical setting under which FedRegNAS operates.
(A1)
Smoothness. For each client i, the local loss function L i ( θ , α ) is continuously differentiable, and its gradient is L-Lipschitz in both θ and α , i.e.,
L i ( θ 1 , α 1 ) L i ( θ 2 , α 2 ) 2 L ( θ 1 θ 2 , α 1 α 2 ) 2 .
(A2)
Bounded gradient variance. The stochastic gradients are unbiased and have bounded variance:
E ^ L i ( θ , α ; B ) = L i ( θ , α ) , E ^ L i L i 2 2 σ 2 .
(A3)
Bounded delay (staleness). The communication latency τ i k associated with client i at round k is bounded, with expected value τ ¯ = E [ τ i k ] and a corresponding attenuation factor ϕ ( τ i k ) ( 0 , 1 ] .
(A4)
Clipping and differential-privacy noise. Each client clips its model and architecture deltas using thresholds ( C w , C α ) and adds independent Gaussian noise with multipliers ( σ w , σ α ) , ensuring bounded sensitivity and user-level privacy.
Under these assumptions, the aggregated optimization objective is the smoothed function
F λ ( θ , α ) = F ( θ , α ) + λ res · R ( θ , α ) ,
where R ( · ) denotes the resource regularization (e.g., FLOPs or latency penalty), and λ res 0 is its associated coefficient.

5.2. Main Convergence Result

Let K be the total number of rounds, q the probability of client participation per round, and E the number of local gradient steps per client. Denote by η w and η α the respective learning rates for weights and architectures, and by K α K the number of rounds in which architecture updates are transmitted. Then, the following result characterizes the expected convergence rate of FedRegNAS.
Theorem 1. 
Under Assumptions (A1)–(A4), and with learning rates η w = Θ ( ( K q E ) 1 / 2 ) and η α = Θ ( ( K q E ) 1 / 2 ) , the expected first-order stationarity gap of F λ satisfies
1 K k = 0 K 1 E F λ ( θ k , α k ) 2 2 = O 1 + τ ¯ K q E + O σ w 2 C w 2 + K α K σ α 2 C α 2 + O λ res .
Furthermore, the multiplicative constant in the first term improves monotonically with the trust parameter λ trust up to a regime-dependent threshold beyond which over-concentration degrades performance.

Proof Sketch

The proof follows from standard stochastic nonconvex analysis adapted to mirror descent on the product of Euclidean and simplex spaces. Specifically,
  • The expected decrease in F λ per round is bounded by the inner product of gradients and updates, adjusted for staleness.
  • Gaussian perturbations are zero-mean, and clipping bias is bounded by E [ ( Δ c 2 C c ) + ] for each channel c { w , α } .
  • The mirror map’s strong convexity ensures a Bregman-descent inequality whose Euclidean norm equivalent differs by at most a constant factor.
  • The cosine-similarity reweighting introduces a bounded variance reduction proportional to exp ( λ trust ) up to saturation.
Combining these arguments and taking expectations over client sampling and DP noise yields the claimed result.

5.3. Impact of Clipping and Trust Weighting

5.3.1. Clipping Bias

For any clipped update Δ c with threshold C c , we can express the bias as
b c = Δ c clip ( Δ c , C c ) , b c 2 E [ ( Δ c 2 C c ) + ] .
As long as C c exceeds the 95th percentile of the empirical update norm, clipping bias contributes only a small additive constant to the convergence bound. Empirically, maintaining clipping ratios below 15 % yields stable optimization without excessive noise amplification.

5.3.2. Trust Weighting Effects

The trust weighting function in Equation (14) improves robustness by aligning updates with the cohort direction Δ θ ¯ . This mechanism effectively acts as a preconditioner that reduces inter-client gradient variance, improving convergence constants without changing the asymptotic rate. However, overly large λ trust can lead to overconfidence, amplifying bias when gradients are diverse. Empirically, λ trust [ 0.5 , 2 ] achieves optimal trade-offs.

5.4. Differential-Privacy Accounting

Each round of FedRegNAS applies an independent Gaussian mechanism to the clipped deltas of both weight and architecture channels, combined with Poisson subsampling over clients. We quantify privacy loss using the Rényi differential privacy (RDP) framework.

5.4.1. Per-Round Mechanism

For channel c { w , α } , define the Gaussian mechanism
M c ( Δ c ) = clip ( Δ c , C c ) + N ( 0 , σ c 2 C c 2 I ) ,
applied to a subsampled set of clients with probability q. The resulting per-round RDP cost of order λ > 1 is
ρ c ( λ ; q , σ c ) = 1 λ 1 log 1 + q 2 e ( λ 1 ) / σ c 2 1 .

5.4.2. Composition and Conversion

Over K c active rounds ( K w = K and K α K ), the cumulative RDP cost is additive:
ε c RDP ( λ ) = K c ρ c ( λ ; q , σ c ) .
Conversion to standard ( ε c , δ c ) -DP follows from
ε c ( δ c ) = min λ > 1 ε c RDP ( λ ) + log ( 1 / δ c ) λ 1 .
The total privacy guarantee can be reported either as separate channels ( ε w , δ w ) and ( ε α , δ α ) , or as an aggregate ( ε , δ ) via composition.

5.4.3. Privacy–Utility Allocation

Because architecture parameters α are typically more sensitive—encoding structural and potentially proprietary information—FedRegNAS allocates a larger privacy budget to this channel. Practically, this is achieved by using higher noise multipliers σ α > σ w and reducing the number of architecture uploads ( K α < K ) through the curriculum schedule. Typical choices satisfying moderate privacy budgets are
σ w [ 0.5 , 2 ] , σ α [ 1 , 4 ] , C w [ 0.5 , 3 ] , C α [ 0.1 , 1 ] .

5.4.4. Amplification by Subsampling and Curriculum

Client-level subsampling ( q < 1 ) amplifies privacy, as only a subset of clients participates in each round. Furthermore, the search-to-train curriculum (cf. Section 4) reduces the number of architecture transmissions after Phase II, yielding K α K and consequently much tighter bounds on ε α .

5.5. Discussion and Implications

The derived convergence rate demonstrates that FedRegNAS maintains the same asymptotic order O ( ( 1 + τ ¯ ) / K q E ) as standard differentially private federated optimization, while incorporating two additional sources of robustness:
  • Trust weighting: Reduces gradient variance and accelerates early-phase convergence under heterogeneous data distributions.
  • Privacy decoupling: Allows independent control over the privacy budgets of weights and architectures, improving model utility without violating DP constraints.
The rate constants are influenced by staleness ( τ ¯ ) and the DP noise scales ( σ w , σ α ); both can be mitigated through appropriate scheduling and noise calibration. In summary, the analysis confirms that FedRegNAS achieves provably stable convergence under realistic federated settings with heterogeneous clients, bounded staleness, and rigorous differential privacy guarantees.

Summary

The theoretical analysis establishes that FedRegNAS converges to a stationary point of the global objective at the rate
O 1 + τ ¯ K q E + O σ w 2 C w 2 + K α K σ α 2 C α 2 ,
while satisfying user-level differential privacy under the Gaussian mechanism with per-channel accounting. These results validate the method’s design choices—namely, trust-weighted aggregation, ADDP privacy decoupling, and the search-to-train curriculum—from a theoretical standpoint.

6. Experiments

We evaluate FedRegNAS on federated equity forecasting tasks covering daily and intraday horizons. Experiments assess accuracy, trading relevance, communication efficiency, privacy, and the contribution of each algorithmic component. All notation follows Section 3 and Section 4.

6.1. Setup

6.1.1. Problem Setting

We evaluate FedRegNAS on three federated equity forecasting benchmarks that reflect distinct temporal granularities and cross-market heterogeneity. Each client corresponds to an institution-like shard composed of disjoint ticker subsets and time ranges. Clients never share raw data; only sanitized updates (weights and, intermittently, architectures) are transmitted. Targets are next-period log-returns y t , and inputs comprise price- and volume-derived features, technical indicators, and normalized microstructure signals. All preprocessing is performed locally to preclude information leakage. Datasets and client partitions is given in Table 1. Data availability is given at the end of this paper.

6.1.2. Data Construction and Preprocessing

For each dataset, we apply the following: (i) session alignment with exchange calendars, (ii) removal of auctions, halts, and limit-up/down prints, (iii) forward-fill of sparse features with mask channels, (iv) log-differencing and z-score normalization per feature using rolling statistics computed strictly within the training horizon, and (v) leakage guards by disallowing same-day lookahead across corporate actions and delayed prints. Missing data are imputed locally with confidence masks; the masks are part of the model inputs. Feature windows of length L are formed with stride 1.

6.1.3. Evaluation Protocol

We adopt a chronological split: train (2012–2024 daily; 2019–2024 intraday), validation (final 10% of the train horizon), and test (calendar year 2025). Metrics are root mean squared error ( RMSE ), directional accuracy ( DA ), and annualized Sharpe ratio ( SR ) for a thresholded long-short policy π ϑ with symmetric threshold κ selected on validation. Reporting averages across clients with equal weights unless otherwise stated. Aggregating the per-dataset improvements discussed in the following sections, the specific results are presented in the tables below, FedRegNAS achieves gains in directional accuracy of 3–7 percentage points and Sharpe ratio improvements of 18–32% over the best federated baselines.

6.1.4. Federated Learning Setting

Each round samples a subset S k of clients with rate q | S k | / N = 0.3 . Local optimization executes E steps per client on minibatches of size 128. Architectural messages are uploaded every H α rounds; weight messages are uploaded every round, clipped, noised, and compressed.

6.1.5. Regime Modeling and Search Space

The regime encoder g ψ ( · ) outputs z t Δ R with R = 3 regimes by default. The supernet includes per-decision candidate operators: { 1 D depthwise separable conv, gated temporal conv, GRU cell, lightweight self-attention, identity}. Mixed operations are gated by α (Gumbel–Softmax) and modulated by z t through per-regime affine adapters.The details of the optimization, privacy, and communication settings are presented in Table 2, while Figure 2 illustrates the chronological data splits for each benchmark with non-overlapping labels.

6.2. Main Results

6.2.1. Overview

This subsection reports forecasting and trading performance across the three benchmarks, together with statistical confidence and method-wise improvements. All results are computed on strictly held-out 2025 test sets with model selection on validation. We report averages across clients with equal weights; for dispersion we compute client-bootstrapped 95 % confidence intervals (CIs) using 1000 resamples. To ensure comparability, thresholds for the trading policy π ϑ are selected on validation per method and then frozen on test.

6.2.2. Accuracy and Trading Quality

Table 3 summarizes RMSE, DA, and SR. FedRegNAS attains the lowest RMSE and highest DA and Sharpe on all three datasets. Relative to the strongest federated baseline (FedProx-Trans), RMSE declines by 6– 11 % , DA increases by 2.2 3.1 points, and SR improves by 12– 28 % . The centralized upper bound (Centralized-DARTS) remains competitive but is consistently matched or exceeded by FedRegNAS in SR despite lacking privacy and communication constraints.

6.2.3. Statistical Significance

We conduct paired tests per client: (i) a one-sided Diebold–Mariano test for squared-error loss (RMSE proxy), and (ii) a stratified permutation test for DA and SR with 10,000 label-preserving shuffles within asset-day blocks. Table 4 reports improvement deltas and the fraction of clients with p < 0.05 . Improvements are broadly significant, particularly on Minute-US, where regime gating captures intraday nonstationarity.The forecasting accuracy (RMSE) and relative improvement are presented in Table 5.

6.2.4. Robustness Across Universes

Table 3 visualizes aggregate performance. To avoid legend clutter, color/shade semantics are encoded in the captions. Across datasets, the relative ordering of methods is stable; dispersion (CI whiskers) is narrowest for FedRegNAS, indicating reduced cross-client variance attributable to trust-weighted aggregation and KL-barycentric architecture pooling.
In general, we find that (1) regime gating and KL-barycentric aggregation yield consistent gains across horizons and geographies; (2) improvements materialize not only in RMSE but also in trading metrics, indicating better calibration around decision thresholds; and (3) reduced dispersion across clients suggests enhanced robustness to non-IID data induced by institutional partitioning.

6.3. Ablation and Component Analysis

In this subsection we present ablations and diagnostic experiments addressing the impact of (i) decoupled DP noise scales ( σ w , σ α ) , (ii) the regime cardinality R, (iii) KL-barycentric aggregation versus simple averaging under non-IID clients, (iv) the calibration of latency and communication proxies, (v) volatility-stratified trading performance for economic interpretability, and (vi) the marginal contributions of core algorithmic components (regime gating, KL aggregation, trust weighting, and decoupled DP). Unless otherwise specified, ablations are conducted on the Daily-US benchmark, which exhibits pronounced market-regime variability and moderate client heterogeneity.

6.3.1. Sensitivity to Decoupled DP Noise Scales

We first study the sensitivity of FedRegNAS to the decoupled DP noise scales ( σ w , σ α ) on Daily-US. We sweep σ w { 0.2 , 0.4 , 0.6 , 0.8 } and σ α { 0.2 , 0.4 , 0.6 , 0.8 } while keeping the target user-level privacy budget fixed by adjusting the number of composition steps accordingly. For each configuration we report RMSE, DA, Sharpe ratio, and a leakage proxy defined as the average norm of architecture gradients before clipping. Figure 3 shows RMSE as a function of σ α for different σ w ; performance is comparatively stable when varying σ w for fixed σ α , whereas increasing σ α beyond 0.4 leads to noticeable degradation in RMSE. As shown in Figure A1 (Appendix B), when σ α is fixed at 0.4, the effect of increasing σ w on RMSE and leakage proxy is minimal, validating that the model performance is insensitive to weight noise. The leakage proxy (not shown for brevity) is also substantially more sensitive to σ α than to σ w , supporting our claim that architecture-related gradients are the dominant leakage channel under our setting. As shown in Figure A2 (Appendix B), when σ w is fixed at 0.4, increasing σ α leads to a significant decrease in RMSE and a sharp rise in the leakage proxy, confirming that the architectural gradient is the main leakage channel.

6.3.2. Effect of Regime Cardinality R

We next investigate how the number of latent regimes R in the encoder g ψ affects performance. We train FedRegNAS with R { 2 , 3 , 4 , 6 , 8 , 10 } on Daily-US and Minute-US under the same training budget and DP configuration. Table 6 summarizes RMSE, DA, and Sharpe ratio, averaged over three runs (In Table 6, RMSE is computed on the de-normalized prediction target (original log-return or percentage scale), and is therefore numerically larger than the standardized RMSEs reported in Table 3 and related results). Performance is relatively poor for very small R (e.g., 2) and slightly unstable for very large R (e.g., 10), while the range R { 4 , 6 } yields the best trade-off across both datasets. This indicates that the regime encoder is reasonably robust to the precise choice of R and transfers across markets without extensive retuning.

6.3.3. KL-Barycentric Aggregation vs. Simple Averaging

To understand the empirical effect of KL-barycentric aggregation, we compare FedRegNAS with a variant that uses simple averaging of architecture logits across clients while keeping all other components unchanged. We evaluate both variants on deliberately non-IID splits of Daily-US, Minute-US, and Daily-Global, where client distributions differ strongly in volatility and sector composition. Table 7 reports RMSE, DA, and the average Jensen–Shannon (JS) divergence between client-wise architecture distributions and the aggregated distribution at convergence. KL-barycentric aggregation consistently improves RMSE and DA and substantially reduces the dispersion of architecture distributions across all three datasets, indicating more coherent architectures under heterogeneous clients.

6.3.4. Calibration of Latency and Communication Proxies

We calibrate our differentiable latency and communication proxies against wall-clock measurements on a heterogeneous cluster containing both GPU and CPU clients and different network conditions. For a set of architectures sampled during search, we record proxy values and actual inference latency and transmitted bytes per round under three representative profiles: (i) GPU + high-bandwidth LAN, (ii) CPU + LAN, and (iii) CPU + constrained wireless (4G). Table 8 reports Pearson correlation and mean absolute percentage error (MAPE) between proxies and measured quantities. Correlations exceed 0.9 for all profiles, with MAPE below 10 % , indicating that the proxies are well aligned with actual hardware behavior and suitable as surrogates in the multi-objective search.

6.3.5. Volatility-Stratified Trading Performance

Finally, to enhance economic interpretability, we stratify test periods into high- and low-volatility regimes based on a rolling realized-volatility threshold (top and bottom terciles) and compare FedRegNAS with two baselines (Local-GRU and FedAvg-LSTM) on Daily-US. Table 9 reports DA and Sharpe ratio for both volatility regimes. FedRegNAS improves DA and Sharpe in both regimes, with particularly pronounced gains during high-volatility periods where the baselines suffer larger drawdowns. This suggests that the regime-gated architecture and federated NAS design not only improve aggregate metrics but also yield more robust trading behavior under extreme market conditions.

6.3.6. Component-Level Ablation on Daily-US

We now evaluate the marginal contributions of the primary algorithmic components within FedRegNAS—namely regime gating, KL-barycentric aggregation, trust weighting, and architecture/weight differential-privacy decoupling (ADDP). Each ablation removes or substitutes a specific element while keeping the same search space, supernet initialization, privacy parameters, and communication budgets. We define five configurations: (i) full model (FedRegNAS complete), (ii) w/o regime gating (replacing regime encoder g ψ with a constant gate), (iii) mean aggregation (Euclidean average of architecture parameters in place of the KL barycenter), (iv) coupled-DP (single shared noise budget for both channels), and (v) no trust weighting (uniform aggregation weights). All models use K = 300 communication rounds, E = 5 local epochs, and H α = 5 , with the same noise multipliers ( σ w , σ α ) = ( 0.75 , 1.2 ) . We evaluate not only predictive accuracy but also communication cost and convergence stability (variance of validation-loss trajectories).
The results in Table 10 indicate that removing any core mechanism leads to measurable degradation. The largest performance drops arise from disabling regime gating and trust weighting, consistent with their roles in addressing nonstationarity and heterogeneity. Using Euclidean aggregation instead of KL barycenters modestly increases RMSE and destabilizes architecture probabilities (higher entropy drift). Coupling DP noise budgets reduces effective utility by over-noising the weight channel. Interestingly, all ablations retain communication efficiency, confirming that improvements stem from algorithmic, not infrastructural, changes.
To better understand component effects, Figure 4 and Figure 5 present convergence and weight-entropy diagnostics. The former plots smoothed validation-loss trajectories across rounds; the latter displays the evolution of trust-weight entropy (a proxy for aggregation concentration). FedRegNAS demonstrates faster and smoother convergence, with significantly lower entropy, reflecting confident and consistent weighting of high-quality clients. Regime gating yields reduced oscillations post-round 150, suggesting effective adaptation to regime transitions, while KL-barycentric aggregation maintains stable architectural probabilities and avoids premature collapse to suboptimal operators.
Overall, the ablation outcomes reinforce the necessity of coupling regime-awareness, information-geometric aggregation, trust-aware weighting, and decoupled differential privacy within a unified optimization pipeline. Removing any mechanism impairs either convergence stability, forecast accuracy, or variance control. Regime gating delivers the most substantial gains, improving both RMSE and DA by capturing temporal heterogeneity. KL-barycentric aggregation contributes to architecture stability and inter-client consistency, while trust weighting accelerates convergence under DP noise. Collectively, these elements ensure FedRegNAS’s superior robustness and efficiency under federated nonstationary conditions.

6.4. Efficiency and Privacy

6.4.1. Objectives and Methodology

This subsection quantifies the communication footprint, on-device latency, and user-level privacy of FedRegNAS relative to strong baselines. We report the following: (i) mean client uplink size per round after clipping, quantization, and sparsification; (ii) median inference latency on a mobile-class CPU for a single forward pass; and (iii) differential-privacy budgets accumulated via Rényi accounting under Poisson subsampling. To isolate design contributions, we additionally decompose the uplink into constituent payloads and vary the architecture upload cadence H α .

6.4.2. Key Observations

First, the search-to-train curriculum coupled with lazy architecture synchronization (uploads every H α rounds) reduces uplink by 43 % versus FedAvg-LSTM and 70 % versus FedProx-Trans on Daily-US, while also lowering on-device latency through regime-gated sparsity. Second, ADDP (decoupled privacy) concentrates noise where it is most sensitive (architecture channel), enabling lower weight noise without violating the global budget. Third, increasing H α tightens the architecture-channel privacy ε α nearly linearly in the fraction of active rounds K α / K , with minimal effect on accuracy when H α [ 4 , 8 ] .
The communication and latency across datasets are presented in Table 11, while the uplink payload decomposition for Daily-US is shown in Table 12. Additionally, Figure 6 presents the mean client uplink per round on Daily-US. Figure 7 illustrates the privacy budget trajectories, where the solid curve represents ϵ w (weights) under ADDP, the dashed curve corresponds to ϵ α (architecture) under ADDP with uploads every H α = 5 until Phase II at round 300, and the dotted/dash-dotted curves show the coupled-DP baseline, where both channels accrue the same budget more rapidly. No legend is included in the figure, as the curve semantics are specified here.
In general, we find that (1) communication savings stem from reduced architecture traffic and compressed weight updates; the latter contribute 76 % of the uplink, while architecture packets account for 19 % under H α = 5 ; (2) the architecture-channel privacy improves nearly linearly with H α , with negligible loss in RMSE up to H α = 8 ; and (3) latency reductions reflect both the smaller selected subgraph post-selection (Phase II) and regime-conditioned adapters that avoid redundant computation during inference.

6.5. Practical Evaluation and Benchmarks

This subsection complements the main experiments by (i) comparing FedRegNAS against strong centralized sequence models, (ii) extending the ablation study to high-frequency data, (iii) quantifying the computational overhead of the NAS search phase and its amortization, and (iv) presenting a simulated institutional case study that incorporates constraints common in brokerage/asset-management settings.

6.5.1. Centralized Sequence-Model Baselines

To contextualize FedRegNAS from a pure forecasting perspective, we benchmark several strong sequence models that are widely used in financial time-series forecasting when centralizing all data is allowed. Specifically, we implement centralized versions of Temporal Fusion Transformers (TFTs), N-BEATS, and DeepAR, and compare them with a centralized variant of our search-discovered FedRegNAS architecture (FedRegNAS-C) on Daily-US and Minute-US. All models are trained on pooled data without any FL or DP constraints, using the same train/validation/test splits and evaluation metrics (RMSE, DA, and Sharpe ratio). Table 13 reports the results. As expected, the centralized baselines define strong upper bounds, particularly on RMSE; however, FedRegNAS-C remains competitive with TFT and N-BEATS across both datasets, while the federated FedRegNAS model trails only slightly despite operating under user-level DP and client-side data locality. These comparisons suggest that FedRegNAS achieves forecasting quality close to state-of-the-art centralized models, while additionally offering the benefits of FL and DP.

6.5.2. Regime Gating and KL Barycenter on High-Frequency Data

The ablation in Table 6 focuses on Daily-US. To test the robustness of the architectural design under high-frequency and nonstationary conditions, we replicate a subset of the ablations on Minute-US, specifically isolating the contributions of regime gating and KL-barycentric aggregation. Table 14 compares four variants: (i) a baseline without regime gating or KL aggregation (standard FedAvg over architectures), (ii) only regime gating, (iii) only KL aggregation, and (iv) the full FedRegNAS. Regime gating and KL-barycentric aggregation each improve performance when added individually, and combining them yields the best RMSE and DA, confirming that the qualitative conclusions from Daily-US carry over to the high-frequency Minute-US setting.

6.5.3. Search-Phase Overhead and Amortization

Because FedRegNAS includes an NAS search phase, decision-makers may be concerned about additional computational overhead relative to directly training a fixed architecture such as FedAvg-LSTM or FedProx-Trans. Table 15 reports total training time and approximate GPU-hours for (i) FedAvg-LSTM, (ii) FedProx-Trans, (iii) FedRegNAS including the full search-to-train curriculum (search + selection + personalization), and (iv) FedRegNAS when reusing a previously discovered architecture for a new market (fine-tuning only). Results are measured on the same cluster with 4 × V100 GPUs and 16 CPU cores. While the initial FedRegNAS run is roughly 2–3× more expensive than directly training a fixed architecture, reusing the discovered architecture reduces the cost to a level comparable with FedProx-Trans. Over multiple deployments (e.g., across three markets or rolling retrain windows), the amortized per-deployment cost of FedRegNAS approaches that of fixed-architecture methods, while providing better forecasting and stronger FL + DP guarantees.

6.5.4. Simulated Institutional Case Study

We construct a simulated brokerage-style scenario to bridge the gap between our academic partitions and real institutional constraints. In this setting, each client corresponds to a large asset manager trading a subset of U.S. and global equities, subject to (i) regulatory trading windows (orders can only be placed during specified intraday intervals), (ii) latency budgets for model inference (e.g., ≤10 ms per prediction for intraday signals), and (iii) trading restrictions such as daily turnover caps and maximum position sizes. These constraints are enforced when generating trades from model predictions and when measuring latency on a heterogeneous cluster with mixed GPU/CPU clients.
Table 16 compares Local-GRU, FedAvg-LSTM, FedProx-Trans, and FedRegNAS under this institutional setup on Minute-US, reporting DA, Sharpe ratio, average daily turnover, and the fraction of trades that would violate turnover/position limits before clipping. FedRegNAS achieves the best DA and Sharpe while maintaining comparable turnover and the lowest violation rate, indicating that the discovered architectures remain effective when realistic trading constraints are imposed.
To assess whether the models meet practical latency requirements, we summarize end-to-end inference latency (including communication and decoding) across clients in Table 17. All methods satisfy the 10 ms budget on average; FedRegNAS remains competitive with FedAvg-LSTM and improves on FedProx-Trans due to the latency-aware objective used during NAS. Together, these results support our conclusion that the relative advantages of FedRegNAS in forecasting and communication persist under a more institutionally realistic regime, while we still caution that unmodeled microstructure effects (e.g., exchange-specific routing latencies) may affect absolute performance in production.

7. Conclusions

We presented FedRegNAS, a regime-aware federated neural architecture search framework for privacy-preserving stock-return forecasting that unifies a regime-gated temporal supernet, mirror-descent search on the simplex, staleness- and trust-aware weight aggregation, KL-barycentric fusion of client architectures, and decoupled differential privacy for weights and architectures (ADDP) under explicit communication and latency proxies. Across daily and intraday federated equity benchmarks, FedRegNAS consistently improved forecasting error, directional accuracy, and trading Sharpe while cutting communication by large margins relative to strong FL baselines and approaching centralized NAS quality without centralizing data. Ablations confirmed the complementary value of regime gating, barycentric aggregation, and DP decoupling, and a search-to-train curriculum yielded deployable, efficient architectures with on-device latency advantages. Limitations include the reliance on a fixed regime cardinality, approximate DP composition, and surrogate resource proxies; future work will investigate online regime discovery, cross-asset transfer and multi-horizon probabilistic targets, tighter privacy accounting and secure aggregation, and theoretical guarantees for barycentric architecture averaging under subsampled, differentially private updates.
We will release reproducibility materials, including synthetic-data scripts and configuration files, to facilitate adoption and benchmarking. Looking ahead, our framework opens several directions for future work. First, FedRegNAS could be extended to additional financial markets and horizons, including cross-asset and multi-horizon settings, to further test its robustness across regimes and liquidity conditions. Second, richer personalization mechanisms—for example combining regime-aware NAS with hypernetwork-based or representation-level personalization—may yield further gains under extreme client heterogeneity. Finally, applying the core ideas of regime gating, probabilistic architecture aggregation, and decoupled DP to other time-series domains (such as macroeconomic indicators, energy load forecasting, or limit-order-book data) is an interesting avenue for expanding the scope of federated NAS beyond equity returns.

Author Contributions

Conceptualization, Z.C. and H.Z.; methodology, Z.C.; software, Z.C.; validation, Z.C., H.Z. and S.W.; formal analysis, S.W.; investigation, H.Z.; resources, J.C.; data curation, S.W.; writing—original draft preparation, Z.C.; writing—review and editing, Z.C. and H.Z.; visualization, S.W.; supervision, H.Z.; project administration, J.C.; funding acquisition, H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This article is not supported by any funds.

Data Availability Statement

The specifics of data availability are presented in the Appendix A.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Data Availability

The datasets used in this study can be reproduced from publicly accessible sources as follows: daily U.S. and global equity bars are available from Stooq’s free market data archive at https://stooq.com/db/ (see, e.g., S&P 500 example page https://stooq.com/q/d/?s=%5Espx) (accessed on 16 June 2025); minute-level U.S. bars can be obtained via Polygon’s Stocks Aggregates (Bars) API https://polygon.io/docs/rest/stocks/aggregates/custom-bars (API key required) (accessed on 14 June 2025) or from Kibot’s historical intraday data https://www.kibot.com/ including free sample files at https://www.kibot.com/free_historical_data.aspx (accessed on 20 July 2025); for researchers with institutional access, the CRSP U.S. Stock database is accessible through WRDS at https://wrds-www.wharton.upenn.edu/pages/about/data-vendors/center-for-research-in-security-prices-crsp/ (accessed on 20 July 2025).

Appendix B. Sensitivity to DP Noise Scales

We provide full results for the ( σ w , σ α ) sensitivity study in Figure A1 and Figure A2, showing performance and leakage proxies as a function of the two noise scales.
Figure A1 and Figure A2 show a clear asymmetry between the impact of the weight-level noise scale σ w and the architecture-level noise scale σ α . When increasing σ w with σ α fixed (Figure A1), both RMSE and the leakage proxy grow only mildly, suggesting that moderate additional noise on model weights can be absorbed by the optimization dynamics without severely harming forecasting performance or substantially changing the gradient statistics. In contrast, when varying σ α at fixed σ w (Figure A2), we observe a pronounced degradation in RMSE and a steep rise in the leakage proxy as σ α increases, indicating that the architecture gradients are far more sensitive to DP perturbations and dominate the overall leakage profile. These trends empirically justify our decoupled DP design, where tighter privacy budgets (smaller noise) are allocated to architecture updates and relatively larger noise can be used for weight updates without significantly compromising performance.
Figure A1. Sensitivity of FedRegNAS to the weight-noise scale σ w (with σ α fixed to 0.4 ) on Daily-US. Both RMSE and the leakage proxy increase only mildly as σ w grows, indicating that performance is relatively insensitive to the weight-level noise scale.
Figure A1. Sensitivity of FedRegNAS to the weight-noise scale σ w (with σ α fixed to 0.4 ) on Daily-US. Both RMSE and the leakage proxy increase only mildly as σ w grows, indicating that performance is relatively insensitive to the weight-level noise scale.
Electronics 14 04902 g0a1
Figure A2. Sensitivity of FedRegNAS to the architecture-noise scale σ α (with σ w fixed to 0.4 ) on Daily-US. Increasing σ α degrades RMSE and sharply increases the leakage proxy, supporting the claim that architecture gradients are the dominant leakage channel.
Figure A2. Sensitivity of FedRegNAS to the architecture-noise scale σ α (with σ w fixed to 0.4 ) on Daily-US. Increasing σ α degrades RMSE and sharply increases the leakage proxy, supporting the claim that architecture gradients are the dominant leakage channel.
Electronics 14 04902 g0a2

References

  1. McMahan, H.B.; Moore, E.; Ramage, D.; Hampson, S.; Agüera y Arcas, B. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), PMLR, Fort Lauderdale, FL, USA, 20–22 April 2017; Volume 54, pp. 1273–1282. [Google Scholar]
  2. Kairouz, P.; McMahan, H.B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A.N.; Bonawitz, K.; Charles, Z.; Cormode, G.; Cummings, R.; et al. Advances and Open Problems in Federated Learning. arXiv 2019, arXiv:1912.04977. [Google Scholar]
  3. Liu, H.; Simonyan, K.; Yang, Y. DARTS: Differentiable Architecture Search. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  4. Pham, H.; Guan, M.; Zoph, B.; Le, Q.V.; Dean, J. ENAS: Efficient Neural Architecture Search via Parameter Sharing. In Proceedings of the 35th International Conference on Machine Learning (ICML), PMLR, Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 4095–4104. [Google Scholar]
  5. Xie, S.; Zheng, H.; Liu, C.; Lin, L. SNAS: Stochastic Neural Architecture Search. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  6. Cai, H.; Zhu, L.; Han, S. ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  7. Abadi, M.; Chu, A.; Goodfellow, I.; McMahan, H.B.; Mironov, I.; Talwar, K.; Zhang, L. Deep Learning with Differential Privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS), Vienna, Austria, 24–28 October 2016; pp. 308–318. [Google Scholar] [CrossRef]
  8. Mironov, I. Rényi Differential Privacy. In Proceedings of the 2017 IEEE 30th Computer Security Foundations Symposium (CSF), Santa Barbara, CA, USA, 21–25 August 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 263–275. [Google Scholar]
  9. Bai, S.; Kolter, J.Z.; Koltun, V. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. arXiv 2018, arXiv:1803.01271. [Google Scholar] [CrossRef]
  10. Lim, B.; Arik, S.Ö.; Loeff, N.; Pfister, T. Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting. Int. J. Forecast. 2021, 37, 1748–1764. [Google Scholar] [CrossRef]
  11. Salinas, D.; Flunkert, V.; Gasthaus, J.; Januschowski, T. DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks. Int. J. Forecast. 2020, 36, 1181–1191. [Google Scholar] [CrossRef]
  12. Zhang, Z.; Zohren, S.; Roberts, S. DeepLOB: Deep Convolutional Neural Networks for Limit Order Books. IEEE Trans. Signal Process. 2019, 67, 3001–3012. [Google Scholar] [CrossRef]
  13. Hamilton, J.D. A New Approach to the Economic Analysis of Nonstationary Time Series and the Business Cycle. Econometrica 1989, 57, 357–384. [Google Scholar] [CrossRef]
  14. Jacobs, R.A.; Jordan, M.I.; Nowlan, S.J.; Hinton, G.E. Adaptive Mixtures of Local Experts. Neural Comput. 1991, 3, 79–87. [Google Scholar] [CrossRef] [PubMed]
  15. Chen, N.; Li, B.; Wang, Y.; Ying, X.; Wang, L.; Zhang, C.; Guo, Y.; Li, M.; An, W. Motion and Appearance Decoupling Representation for Event Cameras. IEEE Trans. Image Process. 2025, 34, 5964–5977. [Google Scholar] [CrossRef] [PubMed]
  16. Feng, Z.-R.; Li, Y.-H.; Chen, W.-Z.; Su, X.-P.; Chen, J.-N.; Li, J.-P.; Liu, H.; Li, S.-B. Infrared and Visible Image Fusion Based on Improved Latent Low-Rank and Unsharp Masks. Spectrosc. Spectr. Anal. 2025, 45, 2034–2044. [Google Scholar]
  17. Tan, C.; Liu, H.; Chen, L.; Wang, J.; Chen, X.; Wang, G. Characteristic analysis and model predictive-improved active disturbance rejection control of direct-drive electro-hydrostatic actuators. Expert Syst. Appl. 2026, 301, 130565. [Google Scholar] [CrossRef]
  18. Genest, C.; Zidek, J.V. Combining Probability Distributions: A Critique and an Annotated Bibliography. Stat. Sci. 1986, 1, 113–135. [Google Scholar]
  19. Konečný, J.; McMahan, H.B.; Ramage, D.; Richtárik, P. Federated Optimization: Distributed Machine Learning for On-Device Intelligence. arXiv 2016, arXiv:1610.02527. [Google Scholar] [CrossRef]
  20. Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smith, V. Federated optimization in heterogeneous networks. Proc. Mach. Learn. Syst. 2020, 2, 429–450. [Google Scholar]
  21. Karimireddy, S.P.; Kale, S.; Mohri, M.; Reddi, S.; Stich, S.U.; Suresh, A.T. SCAFFOLD: Stochastic Controlled Averaging for Federated Learning. In Proceedings of the 37th International Conference on Machine Learning (ICML), PMLR, Virtual, 13–18 July 2020; Volume 119, pp. 5132–5143. [Google Scholar]
  22. Reddi, S.J.; Charles, Z.; Zaheer, M.; Garrett, Z.; Rush, K.; Konečný, J.; Kumar, S.; McMahan, H.B. Adaptive Federated Optimization. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 3–7 May 2021. [Google Scholar]
  23. Xie, C.; Koyejo, O.; Gupta, I. Asynchronous Federated Optimization. arXiv 2019, arXiv:1903.03934. [Google Scholar]
  24. Blanchard, P.; El Mhamdi, E.M.; Guerraoui, R.; Stainer, J. Machine Learning with Adversaries: Byzantine Tolerant Gradient Descent. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; Volume 30, pp. 119–129. [Google Scholar]
  25. Yin, D.; Chen, Y.; Ramchandran, K.; Bartlett, P.L. Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates. In Proceedings of the 35th International Conference on Machine Learning (ICML), PMLR, Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 5636–5645. [Google Scholar]
  26. Cao, X.; Fang, M.; Liu, J.; Gong, N.Z. FLTrust: Byzantine-Robust Federated Learning via Trust Bootstrapping. In Proceedings of the Network and Distributed System Security Symposium (NDSS), Virtual, 21–25 February 2021; Internet Society: Reston, VA, USA, 2021. [Google Scholar]
  27. Alistarh, D.; Grubic, D.; Li, J.; Tomioka, R.; Vojnovic, M. QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; Volume 30, pp. 1709–1720. [Google Scholar]
  28. Lin, Y.; Han, S.; Mao, H.; Wang, Y.; Dally, W.J. Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  29. Stich, S.U.; Cordonnier, J.B.; Jaggi, M. Sparsified SGD with Memory. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 3–8 December 2018; Volume 31. [Google Scholar]
  30. Elsken, T.; Metzen, J.H.; Hutter, F. Neural Architecture Search: A Survey. J. Mach. Learn. Res. 2019, 20, 1–21. [Google Scholar]
  31. Jang, E.; Gu, S.; Poole, B. Categorical Reparameterization with Gumbel-Softmax. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
  32. Maddison, C.J.; Mnih, A.; Teh, Y.W. The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
  33. Zhu, H.; Zhang, H.; Jin, Y. From federated learning to federated neural architecture search: A survey. Complex Intell. Syst. 2021, 7, 1311–1330. [Google Scholar] [CrossRef]
  34. He, C.; Annavaram, M.; Avestimehr, S. FedNAS: Federated deep learning via neural architecture search. arXiv 2020, arXiv:2004.08546. [Google Scholar]
  35. Singh, I.; Zhou, H.; Yang, K.; Ding, M.; Lin, B.; Xie, P. Differentially-private federated neural architecture search. arXiv 2020, arXiv:2006.10559. [Google Scholar] [CrossRef]
  36. Yao, D.; Li, B. PerFedRLNAS: One-for-all personalized federated neural architecture search. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 26–27 February 2024; Volume 38, pp. 16398–16406. [Google Scholar]
  37. Zhang, C.; Shan, G.; Roh, B.H. Fair federated learning for multi-task 6G NWDAF network anomaly detection. IEEE Trans. Intell. Transp. Syst. 2024, 26, 17359–17370. [Google Scholar] [CrossRef]
  38. Yang, Z.; Shao, Z.; Huangfu, H.; Yu, H.; Teoh, A.B.J.; Li, X.; Shan, H.; Zhang, Y. Enhancing federated learning through exploring filter-aware relationships and personalizing local structures. Pattern Recognit. 2025, 171, 112281. [Google Scholar] [CrossRef]
  39. Yang, Z.; Chen, Y.; Wang, Z.; Shan, H.; Chen, Y.; Zhang, Y. Patient-Level Anatomy Meets Scanning-Level Physics: Personalized Federated Low-Dose CT Denoising Empowered by Large Language Model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 5154–5163. [Google Scholar] [CrossRef]
Figure 1. Overview of FedRegNAS. Sensitive client-side financial time-series data (price features, returns, context) are processed locally, fed into a regime-aware differentiable architecture search under ADDP, aggregated via KL-barycentric federated NAS, and deployed as a privacy-preserving stock-return forecasting model.
Figure 1. Overview of FedRegNAS. Sensitive client-side financial time-series data (price features, returns, context) are processed locally, fed into a regime-aware differentiable architecture search under ADDP, aggregated via KL-barycentric federated NAS, and deployed as a privacy-preserving stock-return forecasting model.
Electronics 14 04902 g001
Figure 2. Chronological data splits per benchmark with non-overlapping labels. Light gray: train; medium gray: validation (final 10% of training horizon); dark gray: test (calendar year 2025). Labels are positioned above (Daily-US, Minute-US) or below (Daily-Global) the bars to prevent overlap. No legend is drawn; the caption encodes color semantics.
Figure 2. Chronological data splits per benchmark with non-overlapping labels. Light gray: train; medium gray: validation (final 10% of training horizon); dark gray: test (calendar year 2025). Labels are positioned above (Daily-US, Minute-US) or below (Daily-Global) the bars to prevent overlap. No legend is drawn; the caption encodes color semantics.
Electronics 14 04902 g002
Figure 3. Sensitivity of FedRegNAS to decoupled DP noise scales ( σ w , σ α ) on Daily-US. RMSE increases mainly as σ α grows, while the effect of σ w is comparatively mild for a fixed σ α , indicating that architecture gradients are the dominant leakage–performance bottleneck.
Figure 3. Sensitivity of FedRegNAS to decoupled DP noise scales ( σ w , σ α ) on Daily-US. RMSE increases mainly as σ α grows, while the effect of σ w is comparatively mild for a fixed σ α , indicating that architecture gradients are the dominant leakage–performance bottleneck.
Electronics 14 04902 g003
Figure 4. Validation-loss convergence on Daily-US. Solid: FedRegNAS; dashed/dotted variants correspond to individual ablations (mapping described in the text). The full model converges faster and to a lower final loss, showing improved stability from regime gating, KL aggregation, and trust weighting.
Figure 4. Validation-loss convergence on Daily-US. Solid: FedRegNAS; dashed/dotted variants correspond to individual ablations (mapping described in the text). The full model converges faster and to a lower final loss, showing improved stability from regime gating, KL aggregation, and trust weighting.
Electronics 14 04902 g004
Figure 5. Trust-weight entropy over rounds (lower is better). Solid: FedRegNAS; other curves correspond to ablations (mapping described in the text). The full model shows faster entropy decay, implying more decisive and stable aggregation.
Figure 5. Trust-weight entropy over rounds (lower is better). Solid: FedRegNAS; other curves correspond to ablations (mapping described in the text). The full model shows faster entropy decay, implying more decisive and stable aggregation.
Electronics 14 04902 g005
Figure 6. Mean client uplink per round on Daily-US. Bar fills denote methods (light to dark): FedAvg-LSTM, FedProx-Trans, Centralized-DARTS, and FedRegNAS. No legend is drawn; mapping is provided here.
Figure 6. Mean client uplink per round on Daily-US. Bar fills denote methods (light to dark): FedAvg-LSTM, FedProx-Trans, Centralized-DARTS, and FedRegNAS. No legend is drawn; mapping is provided here.
Electronics 14 04902 g006
Figure 7. Privacy budget trajectories. Solid: ε w (weights) under ADDP; dashed: ε α (architecture) under ADDP with uploads every H α = 5 until Phase II at round 300; dotted/dash-dotted: coupled-DP baseline where both channels accrue the same budget more rapidly. No legend is drawn; curve semantics are specified here.
Figure 7. Privacy budget trajectories. Solid: ε w (weights) under ADDP; dashed: ε α (architecture) under ADDP with uploads every H α = 5 until Phase II at round 300; dotted/dash-dotted: coupled-DP baseline where both channels accrue the same budget more rapidly. No legend is drawn; curve semantics are specified here.
Electronics 14 04902 g007
Table 1. Datasets and client partitions. Each dataset specifies the number of clients N, temporal resolution, train/test spans, input window L, feature dimension d, and median per-client statistics (tickers and train samples). These per-client counts are reported explicitly to facilitate cross-dataset comparison.
Table 1. Datasets and client partitions. Each dataset specifies the number of clients N, temporal resolution, train/test spans, input window L, feature dimension d, and median per-client statistics (tickers and train samples). These per-client counts are reported explicitly to facilitate cross-dataset comparison.
DatasetNResolutionTrain SpanTest SpanLdMedian Tickers/ClientMedian Train Samples/Client
Daily-US201 day2012–20242025643225–30∼80 k sequences
Minute-US251 min2019–20242025120408–10∼2.5 M bars
Daily-Global181 day2012–20242025643220–25∼75 k sequences
Notes: Daily-US clients correspond to S&P 500-style rolling constituents; Minute-US clients are formed from the top 200 U.S. equities after liquidity filtering; Daily-Global clients consist of large-cap US/EU/APAC universes, with disjoint ticker/time partitions across clients.
Table 2. Optimization, privacy, and communication settings. Temperatures anneal τ : 1.5 0.2 . Architecture uploads occur every H α rounds. Quantization uses 8-bit stochastic rounding; sparsification keeps top- k = 20 % entries post-clipping.
Table 2. Optimization, privacy, and communication settings. Temperatures anneal τ : 1.5 0.2 . Architecture uploads occur every H α rounds. Quantization uses 8-bit stochastic rounding; sparsification keeps top- k = 20 % entries post-clipping.
DatasetKE H α q ( C w , C α ) ( σ w , σ α ) Compression
Daily-US300550.3(1.0, 0.2)(0.75, 1.20)8-bit + top-k
Minute-US600550.3(1.0, 0.2)(0.75, 1.20)8-bit + top-k
Daily-Global300550.3(1.0, 0.2)(0.75, 1.20)8-bit + top-k
DatasetRegimes RResource weights ( λ comm , λ lat ) ε w ε α
All3(1.0, 0.5)≈2.0≈1.0
Table 3. Aggregate forecasting and trading metrics on test sets. RMSE (lower is better), DA in %, SR annualized. CIs are client-bootstrapped 95 % intervals.
Table 3. Aggregate forecasting and trading metrics on test sets. RMSE (lower is better), DA in %, SR annualized. CIs are client-bootstrapped 95 % intervals.
MethodDaily-USMinute-USDaily-Global
RMSE DA SR RMSE DA SR RMSE DA SR
Local-GRU0.011853.10.540.0012852.70.960.010353.40.62
CI[0.0116, 0.0120][52.3, 53.8][0.49, 0.59][0.00126, 0.00130][52.1, 53.2][0.90, 1.01][0.0101, 0.0105][52.7, 54.1][0.58, 0.66]
FedAvg-LSTM0.011254.70.680.0012453.61.100.009954.20.74
CI[0.0110, 0.0114][54.1, 55.3][0.63, 0.73][0.00122, 0.00126][53.0, 54.1][1.05, 1.15][0.0097, 0.0101][53.6, 54.8][0.70, 0.78]
FedProx-Trans0.010756.80.780.0011655.31.280.009556.80.85
CI[0.0105, 0.0109][56.2, 57.4][0.73, 0.83][0.00114, 0.00118][54.7, 55.9][1.23, 1.32][0.0093, 0.0097][56.2, 57.3][0.81, 0.89]
Centralized-DARTS0.010657.10.810.0011455.61.310.009457.20.88
CI[0.0104, 0.0108][56.5, 57.7][0.76, 0.85][0.00112, 0.00116][55.0, 56.1][1.26, 1.36][0.0092, 0.0096][56.6, 57.8][0.83, 0.92]
FedRegNAS (ours)0.010259.00.870.0011056.91.410.009259.10.95
CI[0.0101, 0.0104][58.5, 59.5][0.83, 0.91][0.00108, 0.00112][56.4, 57.4][1.37, 1.46][0.0091, 0.0093][58.6, 59.6][0.91, 0.99]
Table 4. Improvements of FedRegNAS over the strongest federated baseline (FedProx-Trans). Deltas are absolute differences; the rightmost columns report the fraction of clients with p < 0.05 under paired tests (RMSE: Diebold–Mariano on squared error; DA/SR: stratified permutation).
Table 4. Improvements of FedRegNAS over the strongest federated baseline (FedProx-Trans). Deltas are absolute differences; the rightmost columns report the fraction of clients with p < 0.05 under paired tests (RMSE: Diebold–Mariano on squared error; DA/SR: stratified permutation).
Δ RMSE Δ DA (pp) Δ SR p < 0.05 RMSE p < 0.05 DA p < 0.05 SR
Daily-US 0.0005 + 2.2 + 0.09 0.720.680.63
Minute-US 0.00006 + 1.6 + 0.13 0.780.740.70
Daily-Global 0.0003 + 2.3 + 0.10 0.690.660.61
Table 5. Forecasting accuracy (RMSE) and relative improvement. Lower RMSE indicates higher accuracy. Improvements are relative to the strongest federated baseline (FedProx-Trans).
Table 5. Forecasting accuracy (RMSE) and relative improvement. Lower RMSE indicates higher accuracy. Improvements are relative to the strongest federated baseline (FedProx-Trans).
MethodRMSEImprovement vs. FedProx-Trans (%)
Daily-US Minute-US Daily-Global Daily-US Minute-US Daily-Global
Local-GRU0.01180.001280.0103−10.3−10.3−8.4
FedAvg-LSTM0.01120.001240.0099−4.7−6.9−4.2
FedProx-Trans0.01070.001160.00950.00.00.0
Centralized-DARTS0.01060.001140.0094+0.9+1.7+1.1
FedRegNAS (ours)0.01020.001100.0092+4.7+5.2+3.2
Table 6. Effect of regime cardinality R on forecasting performance for FedRegNAS on Daily-US and Minute-US. Results are averaged over three runs; best values per column are in bold. R { 4 , 6 } provides a robust trade-off across datasets.
Table 6. Effect of regime cardinality R on forecasting performance for FedRegNAS on Daily-US and Minute-US. Results are averaged over three runs; best values per column are in bold. R { 4 , 6 } provides a robust trade-off across datasets.
RDaily-USMinute-US
RMSE DA (%) Sharpe RMSE DA (%) Sharpe
20.13456.10.620.14954.30.55
30.13256.90.670.14655.20.59
40.12957.80.740.14256.00.66
60.12858.00.760.14156.30.68
80.13157.20.710.14455.80.64
100.13356.80.690.14655.50.61
Table 7. Comparison of KL-barycentric aggregation and simple averaging of architecture logits under non-IID and severely non-IID client partitions on three datasets. KL-based aggregation improves accuracy and reduces the dispersion of client architectures, consistent with more stable behavior under heterogeneity.
Table 7. Comparison of KL-barycentric aggregation and simple averaging of architecture logits under non-IID and severely non-IID client partitions on three datasets. KL-based aggregation improves accuracy and reduces the dispersion of client architectures, consistent with more stable behavior under heterogeneity.
MethodDaily-USMinute-USDaily-GlobalAvg. JS-Div.
RMSE DA (%) RMSE DA (%) RMSE DA (%) (All Datasets)
Simple avg. (non-IID)0.13256.30.14555.10.13855.80.21
KL-barycentric (FedRegNAS)0.12957.80.14156.30.13557.00.13
Simple avg. (severe non-IID)0.13455.70.14754.50.14055.10.24
KL-barycentric (severe non-IID)0.13057.10.14255.80.13656.20.15
Table 8. Calibration of differentiable latency and communication proxies against real measurements under different device and network profiles. High correlation and low MAPE support their use as faithful surrogates in the objective.
Table 8. Calibration of differentiable latency and communication proxies against real measurements under different device and network profiles. High correlation and low MAPE support their use as faithful surrogates in the objective.
Proxy/ProfilePearson CorrelationMAPE (%)
Latency proxy vs. measured (GPU + LAN)0.956.1
Latency proxy vs. measured (CPU + LAN)0.937.8
Latency proxy vs. measured (CPU + 4G)0.919.3
Comm. proxy vs. bytes (GPU + LAN)0.974.8
Comm. proxy vs. bytes (CPU + LAN)0.965.4
Comm. proxy vs. bytes (CPU + 4G)0.947.1
Table 9. Volatility-stratified trading performance on Daily-US. FedRegNAS yields higher directional accuracy and Sharpe ratio in both high- and low-volatility regimes, with especially strong improvements during turbulent periods relative to Local-GRU and FedAvg-LSTM.
Table 9. Volatility-stratified trading performance on Daily-US. FedRegNAS yields higher directional accuracy and Sharpe ratio in both high- and low-volatility regimes, with especially strong improvements during turbulent periods relative to Local-GRU and FedAvg-LSTM.
MethodHigh VolatilityLow Volatility
DA (%) Sharpe DA (%) Sharpe
Local-GRU53.40.5255.90.80
FedAvg-LSTM54.20.6156.70.88
FedRegNAS56.90.8758.31.04
Table 10. Ablation study on Daily-US. Each variant removes or modifies one key mechanism while preserving all other hyperparameters. RMSE (lower better), DA (%), Sharpe ratio (SR), mean client upload per round (KB), and variance of validation loss.
Table 10. Ablation study on Daily-US. Each variant removes or modifies one key mechanism while preserving all other hyperparameters. RMSE (lower better), DA (%), Sharpe ratio (SR), mean client upload per round (KB), and variance of validation loss.
VariantRMSEDA (%)SRComm/Round (KB)Var (loss) × 10 4
FedRegNAS (full)0.010259.00.8716051.4
w/o regime gating0.010856.70.7916052.8
mean aggregation (no KL)0.010657.40.8216052.3
coupled-DP (single noise budget)0.010557.60.8116052.5
no trust weighting0.010757.00.8016052.9
Variant Δ RMSE vs. full Δ DA (pp) Δ SRTrust-weight entropy
w/o regime gating+0.0006−2.3−0.080.91
mean aggregation+0.0004−1.6−0.050.72
coupled-DP+0.0003−1.4−0.060.80
no trust weighting+0.0005−2.0−0.071.00
Table 11. Communication and latency across datasets. Comm/round is mean client uplink after clipping, 8-bit stochastic quantization, and top-k sparsification ( k = 20 % ). Latency is median on-device forward pass on a mobile-class CPU.
Table 11. Communication and latency across datasets. Comm/round is mean client uplink after clipping, 8-bit stochastic quantization, and top-k sparsification ( k = 20 % ). Latency is median on-device forward pass on a mobile-class CPU.
MethodDaily-USMinute-USDaily-Global
Comm/round (KB)Latency (ms)Comm/round (KB)Latency (ms)Comm/round (KB)Latency (ms)
Local-GRU03.905.803.8
FedAvg-LSTM28204.131606.227504.0
FedProx-Trans54106.862208.150906.5
Centralized-DARTS05.707.105.6
FedRegNAS16053.618105.215303.5
MethodDaily-USMinute-USDaily-Global
Uplink stdevDownlink (KB)CPU util. (%)Uplink stdevDownlink (KB)CPU util. (%)Uplink stdevDownlink (KB)CPU util. (%)
FedAvg-LSTM210118064240126072205115063
FedProx-Trans380204078410219083360198077
FedRegNAS145910571609806614088556
Table 12. Uplink payload decomposition (Daily-US). Mean KB per round by component for FedRegNAS. Architectural messages are sent every H α = 5 rounds; weight messages are sent every round. Percentages are relative to total mean uplink.
Table 12. Uplink payload decomposition (Daily-US). Mean KB per round by component for FedRegNAS. Architectural messages are sent every H α = 5 rounds; weight messages are sent every round. Percentages are relative to total mean uplink.
Component Δ θ ^ (post-clip, post-noise) Δ α ^ (active rounds only)Metadata (IDs, seeds)Total
Size (KB)1220310751605
Share (%)76.019.34.7100
Setting H α = 2 H α = 5 H α = 8 H α = 16
Mean uplink (KB)1880160514801430
ε α (at δ = 10 5 )1.481.000.780.60
RMSE (test)0.01020.01020.01030.0104
Table 13. Centralized sequence-model baselines versus FedRegNAS. Centralized models pool all data and do not respect FL/DP constraints, serving as upper bounds or side benchmarks. FedRegNAS-C denotes the centralized version of the architecture discovered by FedRegNAS, while “FedRegNAS (federated, DP)” is the full privacy-preserving federated method.
Table 13. Centralized sequence-model baselines versus FedRegNAS. Centralized models pool all data and do not respect FL/DP constraints, serving as upper bounds or side benchmarks. FedRegNAS-C denotes the centralized version of the architecture discovered by FedRegNAS, while “FedRegNAS (federated, DP)” is the full privacy-preserving federated method.
ModelDaily-USMinute-US
RMSE DA (%) Sharpe RMSE DA (%) Sharpe
DeepAR (centralized)0.12857.10.780.14055.70.64
N-BEATS (centralized)0.12757.60.800.13956.00.67
TFT (centralized)0.12658.20.830.13856.50.70
FedRegNAS-C (centralized)0.12758.00.820.13956.30.69
FedRegNAS (federated, DP)0.12957.80.760.14156.30.68
Table 14. Ablation of regime gating and KL-barycentric aggregation on the Minute-US dataset. Both components contribute to improved performance, and their combination yields the best results, demonstrating stability of the design in a high-frequency, nonstationary scenario.
Table 14. Ablation of regime gating and KL-barycentric aggregation on the Minute-US dataset. Both components contribute to improved performance, and their combination yields the best results, demonstrating stability of the design in a high-frequency, nonstationary scenario.
Variant (Minute-US)RMSEDA (%)Sharpe
No regime gating, no KL (FedAvg logits)0.14655.00.60
Regime gating only0.14355.70.64
KL aggregation only0.14355.90.65
Full FedRegNAS (gating + KL)0.14156.30.68
Table 15. Training-time comparison and amortization for FedRegNAS and fixed-architecture FL baselines on Daily-US. The one-time NAS search is more expensive, but when amortized across multiple markets or retraining periods, the effective per-deployment cost approaches that of FedProx-Trans.
Table 15. Training-time comparison and amortization for FedRegNAS and fixed-architecture FL baselines on Daily-US. The one-time NAS search is more expensive, but when amortized across multiple markets or retraining periods, the effective per-deployment cost approaches that of FedProx-Trans.
MethodTime (h)GPU-hAmortized (3 Dep., GPU-h)
FedAvg-LSTM8.13232
FedProx-Trans10.44141
FedRegNAS (search + train)24.698≈33
FedRegNAS (fine-tune only)9.036≈33
Table 16. Performance of different methods in the simulated institutional scenario on Minute-US. All metrics are computed after enforcing regulatory windows and trading constraints. FedRegNAS attains the highest directional accuracy and Sharpe ratio while keeping turnover and pre-clipping constraint violations at or below the levels of FL baselines.
Table 16. Performance of different methods in the simulated institutional scenario on Minute-US. All metrics are computed after enforcing regulatory windows and trading constraints. FedRegNAS attains the highest directional accuracy and Sharpe ratio while keeping turnover and pre-clipping constraint violations at or below the levels of FL baselines.
MethodDA (%)SharpeAvg. Turnover (%/day)Constraint Violations (%)
Local-GRU54.10.5812.43.9
FedAvg-LSTM55.00.6211.83.2
FedProx-Trans55.60.6611.52.8
FedRegNAS56.40.7311.22.1
Table 17. End-to-end inference latency per prediction under the simulated institutional setup, including communication and decoding, measured across clients on a heterogeneous GPU/CPU cluster. All methods satisfy the 10 ms mean-latency budget; FedRegNAS remains competitive with FedAvg-LSTM and improves on FedProx-Trans despite performing federated NAS.
Table 17. End-to-end inference latency per prediction under the simulated institutional setup, including communication and decoding, measured across clients on a heterogeneous GPU/CPU cluster. All methods satisfy the 10 ms mean-latency budget; FedRegNAS remains competitive with FedAvg-LSTM and improves on FedProx-Trans despite performing federated NAS.
MethodMean Latency (ms)95th Percentile Latency (ms)Budget Satisfied (≤10 ms Mean)
Local-GRU7.69.8Yes
FedAvg-LSTM8.310.5Yes
FedProx-Trans9.411.9Yes
FedRegNAS8.710.8Yes
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, Z.; Zhang, H.; Wang, S.; Chen, J. FedRegNAS: Regime-Aware Federated Neural Architecture Search for Privacy-Preserving Stock Price Forecasting. Electronics 2025, 14, 4902. https://doi.org/10.3390/electronics14244902

AMA Style

Chen Z, Zhang H, Wang S, Chen J. FedRegNAS: Regime-Aware Federated Neural Architecture Search for Privacy-Preserving Stock Price Forecasting. Electronics. 2025; 14(24):4902. https://doi.org/10.3390/electronics14244902

Chicago/Turabian Style

Chen, Zizhen, Haobo Zhang, Shiwen Wang, and Junming Chen. 2025. "FedRegNAS: Regime-Aware Federated Neural Architecture Search for Privacy-Preserving Stock Price Forecasting" Electronics 14, no. 24: 4902. https://doi.org/10.3390/electronics14244902

APA Style

Chen, Z., Zhang, H., Wang, S., & Chen, J. (2025). FedRegNAS: Regime-Aware Federated Neural Architecture Search for Privacy-Preserving Stock Price Forecasting. Electronics, 14(24), 4902. https://doi.org/10.3390/electronics14244902

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop