A Hardware-Aware Federated Meta-Learning Framework for Intraday Return Prediction Under Data Scarcity and Edge Constraints
Abstract
1. Introduction
- Data Scarcity: Local adaptation must succeed with very few samples as a new regime emerges. Few-Shot Learning (FSL) provides useful principles [5], yet most FSL pipelines assume centralized compute and do not directly target edge training budgets.
- Resource Constraints: Trading terminals operate under strict limits on power, memory, and throughput [7,8,9]. Standard on-device fine-tuning can be prohibitively expensive [10,11,12], and the resulting overhead conflicts with the tight latency requirements typical of Intraday Return Prediction workloads.
- Privacy and Communication Efficiency: While Federated Learning (FL) can keep raw factor data local, naïvely synchronizing dense model updates is expensive. In latency-sensitive environments, the communication cost of frequent, high-volume exchanges becomes a practical bottleneck.
| Algorithm 1 Sleep Node Algorithm (SNA) for Local Few-shot Adaptation |
|
- Sensitivity-aware sparse adaptation via the Sleep Node Algorithm (SNA). Unlike conventional magnitude-based pruning or static sparsity strategies, SNA exploits a meta-learned initialization to identify parameters that consistently exhibit low sensitivity during few-shot adaptation. These parameters, referred to as lazy nodes, are frozen rather than removed, which stabilizes local adaptation under severe data scarcity while yielding highly sparse gradient updates. This sensitivity-aware freezing mechanism is fundamentally different from existing sparse training approaches.
- Joint exploitation of multiple sources of training sparsity. In contrast to prior methods that typically exploit only one or a limited subset of sparsity sources, this work simultaneously leverages sparsity in weights, weight updates, back-propagated errors, and activations. This unified formulation enables substantial reductions in computation, memory access, and communication overhead during training, without degrading predictive performance.
- A hardware–software co-designed training accelerator for sparse adaptation. Different from accelerators primarily optimized for inference or dense back-propagation, the proposed EPAST architecture is explicitly designed to support sparse and irregular training workloads. Through a backward pipeline (BPIP) dataflow and a hybrid workload allocation strategy, EPAST effectively translates algorithm-level sparsity into practical latency and energy efficiency gains under realistic edge hardware constraints.
- A federated meta-learning framework aligned with intraday trading practice. Unlike standard federated or meta-learning approaches evaluated under static or IID assumptions, the proposed framework targets near-live intraday trading scenarios. It combines few-shot on-device adaptation, non-IID federated learning, and hardware-aware execution into an end-to-end system, bridging financial modeling requirements with the constraints of edge deployment.
2. Related Works and Preliminaries
2.1. Deep Learning in Quantitative Finance and the Need for Adaptation
2.2. Hardware Acceleration Challenges and Training Dynamics
The Backward Locking Bottleneck in CNN Training
- Feed-Forward (FF): compute activations
- Backward Propagation (BP): propagate errors using rotated weights
- Weight Update (WU): compute gradients and update weights
3. The Hardware-Aware Federated Learning Framework
3.1. Bridging the Gap: The Need for Hardware-Aware Adaptation
- Standard Pre-training: We first train a dense model on the cloud to learn universal market features.
- Meta-Pre-Training (MPT): We then prune the model to a sparse version and explicitly train it for generalization using meta-learning objectives. This ensures the model learns how to adapt quickly.
- SNA-based Fine-tuning: Finally, the model is quantized () and deployed to the edge, where it is fine-tuned using our proposed Sleep Node Algorithm (SNA).
| Algorithm 2 The Hardware-aware Training Framework |
| Require: Cloud dataset , edge dataset , feature extractor , classifier c, epochs , loss L 1: function FSL(D, ) 2: Sample M few-shot tasks from D 3: for do 4: get loss l of using Equations (4) and (5) 5: update with l 6: end for 7: end function 8: 1st stage: pre-training 9: for do 10: for do 11: Update c and with 12: end for 13: end for 14: 2nd stage: meta-pre-training 15: Prune to get 16: for do 17: FSL(, ) 18: end for 19: 3rd stage: fine-tuning 20: Quantize on to get 21: for do 22: FSL(, ) 23: end for |
3.2. The Meta-Pre-Training Strategy
Task Alignment: Numerical Interface for Stable Evaluation
3.3. Federated Meta-Learning with SNA Adaptation
- Local SNA Loop: The client k adapts the model to its private task (comprising a Support Set and Query Set ). Critically, instead of a dense update, we apply a sparse mask derived from SNA:
- Sparse Aggregation: When uploading updates to the server, only the non-zero values masked by SNA are transmitted:
3.4. The Sleep Node Algorithm (SNA): Structural Regularization via Lazy Nodes
- Active Nodes: The trainable parameters that are updated during few-shot adaptation.
- Sparse Nodes (marked ‘0’): Unimportant connections pruned permanently to save memory.
- Lazy Nodes (marked ‘×’): This is our novel contribution. These are weights that are structurally necessary for the forward pass but statistically stable enough to be frozen during the backward pass.
Refined Selection Strategy: Layer-Wise Adaptive Masking (Algorithm 3)
- Scale Invariance: It automatically adapts to the varying dynamic ranges of different layers, avoiding the risk of indiscriminately pruning parameters in layers that naturally possess smaller magnitudes due to normalization or depth.
- Meta-Prior Reliance: Crucially, this selection is performed after the Meta-Pre-Training stage. Since the meta-initialization implies that the model has already converged to a generalized optimum, parameters with negligible magnitudes at this stage represent connections that the meta-learner has deemed structurally redundant for the target distribution. Freezing them acts as an explicit prior to prevent overfitting during few-shot adaptation.
| Algorithm 3 Layer-wise Adaptive Sleep Node Training |
| Require: Pre-trained Model Weights , Sparsity Ratio p, Local Dataset , Learning Rate Ensure: Updated Model 1: // Phase 1: Meta-Prior Mask Generation (Server Side) 2: Initialize Mask set 3: for each layer do 4: Compute layer-specific threshold: 5: ▹ Layer-adaptive threshold 6: Generate binary mask for layer l: 7: 8: 9: end for 10: Dispatch and to Edge Terminal 11: // Phase 2: Lazy-Node Aware Fine-tuning (Edge Side) 12: for each mini-batch do 13: Forward: 14: Backward: Compute gradients 15: for each layer do 16: Apply Sleep Mask to gradients: 17: ▹ Freeze Lazy Nodes 18: Update Active Nodes only: 19: 20: end for 21: end for 22: return |
- Efficiency: We skip the most expensive part of training (WG stage) for a large portion of the network.
- Regularization: By freezing these parameters, we effectively reduce the hypothesis space, preventing the model from overfitting to the limited local data (as proven later in Section 4.2.5).
3.5. Exploiting Intrinsic Error Sparsity
4. Algorithmic Experimental Results and Analysis
4.1. Experimental Setup
- Basic Price–Volume Factors: Direct statistics derived from 30-s OHLCV bars (e.g., VWAP, price range, and volume surges) to capture immediate intraday dynamics.
- ML-Mined Factors: Latent features automatically extracted via localized machine learning algorithms to uncover non-linear intraday patterns.
Cross-Sectional Normalization
- Outlier Handling: Features are clipped using a Winsorization method to mitigate the impact of extreme market volatility.
- Z-score Scaling: Each feature is cross-sectionally standardized to have a mean of 0 and a standard deviation of 1.
- Missing Value Imputation: Any remaining NaNs after normalization are filled with 0 to maintain numerical stability during training.
- Global Pre-training (2023 Full Year): Learn general intraday representations from historical data (Global Prior).
- Local Adaptation (2024 Q1): An ultra-few-shot stress test using a rolling window of only 5 trading days for the edge agent.
- Testing (2024 Q2–Q4): Held-out future data for out-of-sample evaluation.
4.2. Algorithmic Validation: Resolving the “Impossible Trinity”
4.2.1. Mechanism Validation: Why Weight Magnitude Proxies Sensitivity
4.2.2. Sparsity as a Regularizer: The “Less Is More” Phenomenon
- Overfitting Zone (Ratio < 0.3): The dense baseline (Ratio 0.0) yields a sub-optimal IC (≈0.17). With full degrees of freedom to update all parameters on scarce 5-day samples, the model over-adapts to stochastic market noise rather than generalized features.
- Sweet Spot (Ratio 0.4–0.8): As we increase the frozen ratio, the test IC actually climbs, forming a stable performance plateau that peaks at ≈0.22. This confirms that SNA acts as a structural regularizer, effectively restricting the hypothesis space to prevent catastrophic overfitting. This gap between the peak and the dense baseline demonstrates that the algorithmic gains of SNA are orthogonal to the numerical labeling strategy.
- Collapse Zone (Ratio > 0.9): Performance only degrades when sparsity becomes aggressive enough to prune the “Active Nodes”—the critical parametric logic required for signal recovery.
4.2.3. Stability Under Concept Drift: Safety Through Inertia
4.2.4. Addressing Hardware Constraints: Sparsity Sensitivity
4.2.5. Robustness Analysis of Lazy Node Selection
4.3. Comparative Evaluation Against State-of-the-Art
4.3.1. Baselines and Experimental Rigor
- GARCH-XGBoost [43]: The industry gold standard for tabular financial data. It combines econometric volatility modeling (GARCH) with gradient boosting (XGBoost). It serves as the primary baseline for static supervised learning.
- DQN [44]: Represents value-based Reinforcement Learning, often touted for its ability to learn policies directly from market interaction.
- ELM [46]: Represents lightweight randomized learning, included to benchmark training speed and stability.
4.3.2. Performance Analysis: Why Others Fail at the Edge
5. The EPAST Training Accelerator
| Algorithm 4 EPAST Accelerator Execution Flow for Sparse On-device Training |
|
5.1. The Whole Training Architecture
5.2. The Hybrid Workload Allocation Scheme
5.3. The Proposed Backward Pipeline Dataflow for the Heterogeneous Architecture
6. Evaluations and Discussions
6.1. Hardware Implementation Results and Comparisons
6.2. Ablation Study: Bridging Computational Reduction and Latency Speed-Up
- From Theoretical Sparsity to Effective Utilization: While pure weight sparsity significantly drops the theoretical computational cost (as seen in the left subfigure of Figure 15), it initially fails to provide a proportional reduction in latency (right subfigure remains at ). This bottleneck is mainly caused by irregular memory access and workload imbalance under unstructured sparsity. By introducing the Line-up FIFO scheme, we restore PE utilization, finally transforming these theoretical gains into a measurable speed-up in latency.
- Targeting the WG Stage Bottleneck: The Weight Gradient (WG) stage remains the dominant bottleneck after balancing. Incorporating SNA prunes the WG computations structurally, pushing the latency speed-up from to . Further exploiting dynamic error sparsity prunes redundant gradient/error paths and lifts the speed-up to .
- Maximizing Throughput via Pipeline Parallelism: The final leap in performance comes from the Backward Pipeline (BPIP) dataflow. By decoupling and overlapping the BP and WG stages (represented by the orange “BP&WG” blocks in Figure 15), we eliminate serialization delays and boost the end-to-end latency speed-up to . Notably, this improvement is achieved with almost unchanged computational cost (left subfigure saturates at ), highlighting that pipeline scheduling primarily converts algorithmic sparsity into system-level throughput.
6.3. Qualitative Comparisons with Related Works
- Sparsity: Several most related training accelerators that also support sparsity in fine-tuning stage are selected to be compared. Ref. [22] supports dual zero skipping for input and output during FP, which suffers from an increased control logic overhead and a degraded hardware utilization during BP and WU stages. Instead, we exploit one sparsity type for each training stage for simpler control and higher hardware utilization for all stages. Similar to the proposed method, Procrusters [23] also adopts one-sided sparsity for each stage. Compared with Procrusters, we develop the sparsity in a more fine-grained way and use the feature of the finetuning process to deeply exploit the computation redundancy. More specifically, we utilize the well pre-trained model to determine significant connections, and skip unnecessary computations for both weights and weight updates in the proposed SNA. Besides, we explore a new source of sparsity, error sparsity, for finetuning process. In conclusion, all of the four training sparsity sources from weights, weight updates, activations (by clock gating), and errors, are leveraged in the proposed EPAST, which outperform the three types of sparsity at most in previous works [13,20,21,22,23]. As verified in Section 6, these optimization skills greatly contribute to the final training speed-up.
- Pipeline processing: Ref. [15] proposes a pipeline structure enabling parallel computing of all three training stages, but their proposed DF-LNPU can only update the last few fully connected layers since they found that the accuracy of the PDFA, one training computation skipping scheme they adopt, will greatly decrease when the PDFA is applied to the prior convolutional layers. Besides, the pipeline design for all three learning stages is limited by the backward blocking problem and brings complicated control logic. In [54], a highly parallel FPGA implementation with pipeline dataflow is proposed for training. However, the proposed dataflow is designed for quite a simple network containing only one hidden layer, which allows it to achieve parallelization in different stages. This advantage cannot be easily extended to more complicated structures or datasets of larger sizes. On the contrary, the proposed BPIP dataflow in our work can support the training for the whole network (including the convolutional layers) of larger sizes (e.g., ResNet) with limited accuracy loss. Besides, sufficiently exploited sparsity is incorporated in the dedicated design of BPIP, which ensures quite low latency with low hardware overhead.
7. Conclusions
8. Limitations and Future Research Directions
- (1)
- Generalization across markets and regimes. Our evaluation is performed on a specific asset universe and protocol. Additional validation across different market microstructures, asset classes, and extreme-event regimes is needed. Future work will extend benchmarking to broader markets and investigate domain-adaptive pretraining and calibration.
- (2)
- Sensitivity to sparsity and few-shot settings. SNA depends on design choices such as sparsity thresholds, the split between backbone and adaptive components, and the few-shot window length. A more systematic sensitivity analysis and theoretical understanding under heterogeneous, non-IID client streams remain open. Future studies may explore principled sparsity schedules and automated budget selection.
- (3)
- Privacy and communication beyond sparse updates. While federated optimization reduces raw data exposure, stronger privacy guarantees (e.g., differential privacy, secure aggregation) and robustness to inference attacks may be required in practice. Future work will quantify privacy–utility–latency trade-offs under realistic networking conditions.
- (4)
- Full-stack deployment and hardware integration. Our hardware design addresses key bottlenecks induced by irregular sparsity and backward computations, but end-to-end integration (memory hierarchy, host interface, compiler/runtime co-optimization) and portability across edge platforms are not fully explored. Future work will pursue full-stack implementation and broader design-space exploration.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Kong, J.; Zhao, X.; He, W.; Yang, X.; Jin, X. EL-MTSA: Stock Prediction Model Based on Ensemble Learning and Multimodal Time Series Analysis. Appl. Sci. 2025, 15, 4669. [Google Scholar] [CrossRef]
- Dželihodžić, A.; Žunić, A.; Žunić Dželihodžić, E. Predictive Modeling of Stock Prices Using Machine Learning: A Comparative Analysis of LSTM, GRU, CNN, and RNN Models. In Proceedings of the International Symposium on Innovative and Interdisciplinary Applications of Advanced Technologies; Springer: Berlin/Heidelberg, Germany, 2024; pp. 447–467. [Google Scholar]
- Han, H.; Liu, Z.; Barrios Barrios, M.; Li, J.; Zeng, Z.; Sarhan, N.; Awwad, E.M. Time series forecasting model for non-stationary series pattern extraction using deep learning and GARCH modeling. J. Cloud Comput. 2024, 13, 2. [Google Scholar] [CrossRef]
- Guo, Y.; Hu, C.; Yang, Y. Predict the Future from the Past? On the Temporal Data Distribution Shift in Financial Sentiment Classifications. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Singapore, 2023; pp. 1029–1038. [Google Scholar] [CrossRef]
- Wood, K.; Kessler, S.; Roberts, S.J.; Zohren, S. Few-shot learning patterns in financial time-series for trend-following strategies. arXiv 2023, arXiv:2310.10500. [Google Scholar] [CrossRef]
- Lin, J.; Zhu, L.; Chen, W.M.; Wang, W.C.; Gan, C.; Han, S. On-Device Training Under 256KB Memory. In Advances in Neural Information Processing Systems (NeurIPS); MIT Press: Cambridge, MA, USA, 2022; Volume 35, pp. 22941–22954. [Google Scholar]
- Zhang, Y.; Zhang, Y.; Peng, L.; Quan, L.; Zheng, S.; Lu, Z.; Chen, H. Base-2 Softmax Function: Suitability for Training and Efficient Hardware Implementation. IEEE Trans. Circuits Syst. I Regul. Pap. 2022, 69, 3605–3618. [Google Scholar] [CrossRef]
- Du, L.; Ni, L.; Liu, X.; Peng, G.; Li, K.; Mao, W.; Yu, H. A Low-Power DNN Accelerator with Mean-Error-Minimized Approximate Signed Multiplier. IEEE Open J. Circuits Syst. 2024, 5, 57–68. [Google Scholar] [CrossRef]
- Chen, Y.; Zou, J.; Chen, X. April: Accuracy-Improved Floating-Point Approximation For Neural Network Accelerators. In 2025 62nd ACM/IEEE Design Automation Conference (DAC); IEEE: New York, NY, USA, 2025; pp. 1–7. [Google Scholar] [CrossRef]
- Ahmed, M.P.; Tisha, S.A.; Sweet, M.R. Real-Time Hybrid Optimization Models for Edge-Based Financial Risk Assessment: Integrating Deep Learning with Adaptive Regression for Low-Latency Decision Making. J. Bus. Manag. Stud. 2025, 7, 38–52. [Google Scholar] [CrossRef]
- Qin, M.; Sun, S.; Zhang, W.; Xia, H.; Wang, X.; An, B. Earnhft: Efficient hierarchical reinforcement learning for high frequency trading. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; AAAI Press: Washington, DC, USA, 2024; Volume 38, pp. 14669–14676. [Google Scholar]
- Chen, L.; Guo, K.; Fan, G.; Wang, C.; Song, S. Resource constrained profit optimization method for task scheduling in edge cloud. IEEE Access 2020, 8, 118638–118652. [Google Scholar] [CrossRef]
- Kim, S.; Lee, J.; Kang, S.; Lee, J.; Jo, W.; Yoo, H.J. PNPU: An Energy-Efficient Deep-Neural-Network Learning Processor with Stochastic Coarse–Fine Level Weight Pruning and Adaptive Input/Output/Weight Zero Skipping. IEEE Solid-State Circuits Lett. 2021, 4, 22–25. [Google Scholar] [CrossRef]
- Qi, C.; Liu, Y.; Chen, H.; Ge, F.; Liu, W. CIR-NoC: Accelerating CNN Inference Through In-Router Computation During Network Congestion. In 2025 International Symposium of Electronics Design Automation (ISEDA); IEEE: New York, NY, USA, 2025; pp. 29–34. [Google Scholar] [CrossRef]
- Han, D.; Lee, J.; Yoo, H.J. DF-LNPU: A Pipelined Direct Feedback Alignment-Based Deep Neural Network Learning Processor for Fast Online Learning. IEEE J. Solid-State Circuits 2021, 56, 1630–1640. [Google Scholar] [CrossRef]
- Zhu, P.; Li, Y.; Hu, Y.; Xiang, S.; Liu, Q.; Cheng, D.; Liang, Y. MCI-GRU: Stock Prediction Model Based on Multi-Head Cross-Attention and Improved GRU. Neurocomputing 2025, 638, 130168. [Google Scholar] [CrossRef]
- Chen, S.; Ren, S.; Zhang, Q. Hybrid Architectures that Combine LLMs and Predictive Analytics for Next-Generation Financial Modeling. Math. Model. Algorithm Appl. 2025, 6, 31–43. [Google Scholar] [CrossRef]
- Mao, W.; Liu, D.; Zhou, H.; Li, F.; Li, K.; Wu, Q.; Yang, J.; Cheng, Q.; Zhang, L.; Yu, H. A 28-nm 135.19 TOPS/W Bootstrapped-SRAM Compute-in-Memory Accelerator with Layer-Wise Precision and Sparsity. IEEE Trans. Circuits Syst. I Regul. Pap. 2025, 72, 3236–3246. [Google Scholar] [CrossRef]
- Chen, H.; Hao, Y.; Zou, Y.; Chen, X. OA-LAMA: An Outlier-Adaptive LLM Inference Accelerator with Memory-Aligned Mixed-Precision Group Quantization. In 2025 IEEE/ACM International Conference on Computer-Aided Design (ICCAD); IEEE: New York, NY, USA, 2025. [Google Scholar] [CrossRef]
- Zhang, J.; Chen, X.; Song, M.; Li, T. Eager pruning: Algorithm and architecture support for fast training of deep neural networks. In 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA); IEEE: New York, NY, USA, 2019; pp. 292–303. [Google Scholar]
- Lee, J.; Lee, J.; Han, D.; Lee, J.; Park, G.; Yoo, H.J. 7.7 LNPU: A 25.3 TFLOPS/W sparse deep-neural-network learning processor with fine-grained mixed precision of FP8-FP16. In 2019 IEEE International Solid-State Circuits Conference-(ISSCC); IEEE: New York, NY, USA, 2019; pp. 142–144. [Google Scholar]
- Kang, S.; Han, D.; Lee, J.; Im, D.; Kim, S.; Kim, S.Y.; Ryu, J.; Yoo, H. GANPU: An Energy-Efficient Multi-DNN Training Processor for GANs with Speculative Dual-Sparsity Exploitation. IEEE J. Solid-State Circuits 2021, 56, 2845–2857. [Google Scholar] [CrossRef]
- Yang, D.; Ghasemazar, A.; Ren, X.; Golub, M.; Lemieux, G.; Lis, M. Procrustes: A dataflow and accelerator for sparse deep neural network training. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO); IEEE: New York, NY, USA, 2020; pp. 711–724. [Google Scholar]
- Tang, Y.; Zhang, X.; Zhou, P.; Hu, J. EF-train: Enable efficient on-device CNN training on FPGA through data reshaping for online adaptation or personalization. In ACM Transactions on Design Automation of Electronic Systems (TODAES); Association for Computing Machinery: New York, NY, USA, 2022; Volume 27, pp. 1–36. [Google Scholar]
- 강두석. Hardware-Aware Software Optimization Techniques for Convolutional Neural Networks on Embedded Systems. Ph.D. Thesis, 서울대학교대학원, Seoul, Republic of Korea, 2021. [Google Scholar]
- Paissan, F.; Nadalini, D.; Rusci, M.; Ancilotto, A.; Conti, F.; Benini, L.; Farella, E. Structured Sparse Back-propagation for Lightweight On-Device Continual Learning on Microcontroller Units. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW); IEEE: New York, NY, USA, 2024; pp. 2172–2181. [Google Scholar] [CrossRef]
- Zhao, Y.; Li, H.; Young, I.; Zhang, Z. Poor Man’s Training on MCUs: A Memory-Efficient Quantized Back-Propagation-Free Approach. arXiv 2024, arXiv:2411.05873. [Google Scholar] [CrossRef]
- Deutel, M.; Hannig, F.; Mutschler, C.; Teich, J. On-Device Training of Fully Quantized Deep Neural Networks on Cortex-M Microcontrollers. arXiv 2024, arXiv:2407.10734. [Google Scholar] [CrossRef]
- Nakahara, H.; Sada, Y.; Shimoda, M.; Sayama, K.; Jinguji, A.; Sato, S. FPGA-based training accelerator utilizing sparseness of convolutional neural network. In 2019 29th International Conference on Field Programmable Logic and Applications (FPL); IEEE: New York, NY, USA, 2019; pp. 180–186. [Google Scholar]
- Chen, Y.; Liu, Z.; Xu, H.; Darrell, T.; Wang, X. Meta-Baseline: Exploring Simple Meta-Learning for Few-Shot Learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2021; pp. 9062–9071. [Google Scholar]
- Yue, Z.; Zhang, H.; Sun, Q.; Hua, X.S. Interventional few-shot learning. In Advances in Neural Information Processing Systems (NeurIPS); MIT Press: Cambridge, MA, USA, 2020; Volume 33, pp. 2734–2746. [Google Scholar]
- Li, T.; Liu, Z.; Shen, Y.; Wang, X.; Chen, H.; Huang, S. Master: Market-guided stock transformer for stock price forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Washington, DC, USA, 2024; Volume 38, pp. 162–170. [Google Scholar]
- Mazza, L. Coarse-Graining the Cross-Section: How Regression-via-Classification Improves Robustness in High-Noise, Small-Sample-Size Domains such as Cross-Sectional Asset Pricing. Master’s Thesis, KTH, School of Electrical Engineering and Computer Science, Stockholm, Sweden, 2024. [Google Scholar]
- Fischer, T.; Krauss, C. Deep learning with long short-term memory networks for financial market predictions. Eur. J. Oper. Res. 2018, 270, 654–669. [Google Scholar] [CrossRef]
- Jiang, J.; Yang, C.; Wang, X.; Li, B. Why Regression? Binary Encoding Classification Brings Confidence to Stock Market Index Price Prediction. arXiv 2025, arXiv:2506.03153. [Google Scholar]
- Finn, C.; Abbeel, P.; Levine, S. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In Proceedings of the 34th International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 1126–1135. [Google Scholar]
- Evci, U.; Gale, T.; Menick, J.; Castro, P.S.; Elsen, E. Rigging the lottery: Making all tickets winners. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 13–18 July 2020; pp. 2943–2952. [Google Scholar]
- Mrabah, N.; Richet, N.; Ben Ayed, I.; Granger, E. Sparsity Outperforms Low-Rank Projections in Few-Shot Adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2025; pp. 3143–3152. [Google Scholar]
- Yang, X.; Liu, W.; Zhou, D.; Bian, J.; Liu, T.Y. Qlib: An AI-oriented Quantitative Investment Platform. arXiv 2020, arXiv:2009.11189. [Google Scholar] [CrossRef]
- Kakushadze, Z. 101 Formulaic Alphas. Wilmott Mag. 2016, 84, 72–80. [Google Scholar] [CrossRef]
- Novy-Marx, R. The Other Side of Value: The Gross Profitability Premium. J. Financ. Econ. 2013, 108, 1–28. [Google Scholar] [CrossRef]
- Asness, C.S.; Frazzini, A.; Pedersen, L.H. Quality Minus Junk. Rev. Account. Stud. 2019, 24, 34–112. [Google Scholar] [CrossRef]
- Maingo, I.; Ravele, T.; Sigauke, C. A Fusion of Statistical and Machine Learning Methods: GARCH-XGBoost for Improved Volatility Modelling of the JSE Top40 Index. Int. J. Financ. Stud. 2025, 13, 155. [Google Scholar] [CrossRef]
- Madhulatha, T.S.; Ghori, M.A.S. Deep neural network approach integrated with reinforcement learning for forecasting exchange rates using time series data and influential factors. Sci. Rep. 2025, 15, 29009. [Google Scholar] [CrossRef]
- Bieganowski, B.; Ślepaczuk, R. Supervised autoencoder MLP for financial time series forecasting. J. Big Data 2025, 12, 207. [Google Scholar] [CrossRef]
- Cheng, L.; Cheng, X.; Liu, S. Fast Learning in Quantitative Finance with Extreme Learning Machine. arXiv 2025, arXiv:2505.09551. [Google Scholar] [CrossRef]
- Wang, M.; Lu, S.; Zhu, D.; Lin, J.; Wang, Z. A high-speed and low-complexity architecture for softmax function in deep learning. In 2018 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS); IEEE: New York, NY, USA, 2018; pp. 223–226. [Google Scholar]
- Choi, S.; Sim, J.; Kang, M.; Choi, Y.; Kim, H.; Kim, L.S. A 47.4 μJ/epoch Trainable Deep Convolutional Neural Network Accelerator for In-Situ Personalization on Smart Devices. In 2019 IEEE Asian Solid-State Circuits Conference (A-SSCC); IEEE: New York, NY, USA, 2019; pp. 57–60. [Google Scholar]
- Zhao, Y.; Li, C.; Wang, Y.; Xu, P.; Zhang, Y.; Lin, Y. DNN-chip predictor: An analytical performance predictor for DNN accelerators with various dataflows and hardware architectures. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: New York, NY, USA, 2020; pp. 1593–1597. [Google Scholar]
- Lu, W.; Pei, H.H.; Yu, J.R.; Chen, H.M.; Huang, P.T. A 28nm Energy-Area-Efficient Row-based pipelined Training Accelerator with Mixed FXP4/FP16 for On-Device Transfer Learning. In 2024 IEEE International Symposium on Circuits and Systems (ISCAS); IEEE: New York, NY, USA, 2024; pp. 1–5. [Google Scholar] [CrossRef]
- Wang, Y.; Deng, D.; Liu, L.; Wei, S.; Yin, S. PL-NPU: An Energy-Efficient Edge-Device DNN Training Processor with Posit-Based Logarithm-Domain Computing. IEEE Trans. Circuits Syst. I Regul. Pap. 2022, 69, 4042–4055. [Google Scholar] [CrossRef]
- Venkataramanaiah, S.K.; Meng, J.; Suh, H.S.; Yeo, I.; Saikia, J.; Cherupally, S.K.; Zhang, Y.; Zhang, Z.; Seo, J.S. A 28-nm 8-bit Floating-Point Tensor Core-Based Programmable CNN Training Processor with Dynamic Structured Sparsity. IEEE J. Solid-State Circuits 2023, 58, 1885–1897. [Google Scholar] [CrossRef]
- Qian, J.; Ge, H.; Lu, Y.; Shan, W. A 4.69-TOPS/W Training, 2.34-μJ/Image Inference On-Chip Training Accelerator with Inference-Compatible Backpropagation and Design Space Exploration in 28-nm CMOS. IEEE J. Solid-State Circuits 2025, 60, 298–307. [Google Scholar] [CrossRef]
- Dey, S.; Chen, D.; Li, Z.; Kundu, S.; Huang, K.W.; Chugg, K.M.; Beerel, P.A. A highly parallel FPGA implementation of sparse neural network training. In 2018 International Conference on ReConFigurable Computing and FPGAs (ReConFig); IEEE: New York, NY, USA, 2018; pp. 1–4. [Google Scholar]















| Method | Bits (Factor of d) | Speedup vs. Dense | Efficiency vs. SNA |
|---|---|---|---|
| Dense (FedAvg) | 10.0% | ||
| Top-k + Raw Indices | 20.0% | ||
| Top-k + Block Encoding | ∼4.44× | 44.4% | |
| Top-k + Delta Encoding | ∼6.67× | 66.7% | |
| Top-k + Entropy Coding (Limit) | ∼1.132 d | ∼7.07× | 70.7% |
| SNA (Ours) | 100.0% |
| Metric | Full-FT (Dense) | Top-k (Dynamic) | SNA (Ours) |
|---|---|---|---|
| Mask Selection | N/A | Per-round Gradient | Meta-sensitivity |
| Regularization | None | Weak | Structural Anchoring |
| Index Transmission | None | Required (support signaling; coding-dependent) | None (Index-free) |
| Adaptation Stability | Poor (Overfitting) | Moderate | High (Robust) |
| Communication Efficiency | 1.0× | Up to ∼7.1× | 10.0× |
| Model | Core Principle | Best IC | Critical Flaw in High-Resolution Intraday Trading Scenario |
|---|---|---|---|
| SNA (Ours) | Federated Meta-Learning + Sparse Adaptation | 0.1176 | N/A (Achieves Optimal Privacy-Efficiency Balance) |
| GARCH-XGBoost | Ensemble of Econometrics & Gradient Boosting | 0.1063 | Inductive Bias Mismatch: Statistical splitting requires dense data; refitting on 5-day samples leads to severe overfitting. |
| DQN | Reinforcement Learning (Q-Learning) | 0.1036 | Instability: Diverges easily in noisy, few-shot environments. |
| MCI-GRU | Cross-Attention + Gated RNN | 0.0767 | Data Hunger: Suffers from catastrophic overfitting on small 5-day datasets. |
| SA-MLP | Supervised Representation Learning | 0.0766 | Over-parameterization: Lacks structural regularization for sparse data. |
| ELM | Randomized Weight Learning | 0.0598 | Under-fitting: Too simple to capture complex non-linear market factors. |
| Metric | This Work | Tesla V100 | ISCAS24 [50] | TCASI22 [51] | JSSC23 [52] | JSSC25 [53] |
|---|---|---|---|---|---|---|
| Sparsity Support | Yes | No | No | No | Yes | Yes |
| Supply Voltage (V) | 0.7–1.1 | - | - | - | 0.6–1.1 | 0.43–0.9 |
| Area (mm2) | 7.58 | 815 | 1.84 | 5.28 | 16.4 | 2 |
| Bit Precision | FXP8 | FP16/32 | FP4/8 + FP16/32 | Posit8 | FP8/FP16 | FXP8 |
| Max Freq. (MHz) | 200 | 1455 | 160 | 1040 | 340 | 200 |
| Peak Perf. (TOPS) | 0.10–1.14 | 120 (FP16) | 0.157 | 0.532 | 1.24–3.76 | 0.0384 |
| Power (mW) | 44.6 | 300,000 | 67.4 | 11–343 | 51.1–623.7 | 0.836–18 |
| Efficiency (TOPS/W) | 2.2–45.78 | 0.0004 | 2.19 | 1.21–4.51 | 5.3–11.7 | 2.13–4.69 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Wen, Z.; Cheng, X.; Xue, R.; Ye, J.; Wang, Z.; Wang, M. A Hardware-Aware Federated Meta-Learning Framework for Intraday Return Prediction Under Data Scarcity and Edge Constraints. Appl. Sci. 2026, 16, 2319. https://doi.org/10.3390/app16052319
Wen Z, Cheng X, Xue R, Ye J, Wang Z, Wang M. A Hardware-Aware Federated Meta-Learning Framework for Intraday Return Prediction Under Data Scarcity and Edge Constraints. Applied Sciences. 2026; 16(5):2319. https://doi.org/10.3390/app16052319
Chicago/Turabian StyleWen, Zhe, Xin Cheng, Ruixin Xue, Jinao Ye, Zhongfeng Wang, and Meiqi Wang. 2026. "A Hardware-Aware Federated Meta-Learning Framework for Intraday Return Prediction Under Data Scarcity and Edge Constraints" Applied Sciences 16, no. 5: 2319. https://doi.org/10.3390/app16052319
APA StyleWen, Z., Cheng, X., Xue, R., Ye, J., Wang, Z., & Wang, M. (2026). A Hardware-Aware Federated Meta-Learning Framework for Intraday Return Prediction Under Data Scarcity and Edge Constraints. Applied Sciences, 16(5), 2319. https://doi.org/10.3390/app16052319

