1. Introduction
Stock markets play a central role in modern financial systems, serving as barometers of macroeconomic conditions, channels for capital allocation, and platforms for investment and risk management. As global market capitalization has expanded from a few trillion USD in the late twentieth century to over one hundred trillion USD today, the scale and complexity of equity markets have increased substantially [
1]. Accurate stock price forecasting has therefore become essential not only for portfolio optimization and algorithmic trading [
2] but also for broader objectives such as financial stability assessment and policy analysis. Despite extensive research, achieving robust and high-precision forecasts remains challenging due to the intricate and evolving nature of financial time series.
Stock prices exhibit pronounced non-stationarity, nonlinear dynamics, and sensitivity to both endogenous drivers such as liquidity and order flow and exogenous influences such as macroeconomic announcements and geopolitical events [
3]. Traditional statistical models, including autoregressive (AR), autoregressive moving average (ARMA), and autoregressive integrated moving average (ARIMA) [
4], rely on linearity and stationarity assumptions that rarely hold in real markets, often resulting in biased or unstable predictions. Classical machine learning methods such as support vector machines, decision trees, and nearest-neighbor algorithms [
5,
6,
7,
8] offer greater flexibility but still struggle to capture long-range dependencies, nonlinear regime transitions, and complex cross-feature interactions [
9,
10].
Deep learning has significantly advanced time-series forecasting by enabling nonlinear representation learning and hierarchical feature extraction [
11,
12,
13,
14,
15,
16,
17]. Long short-term memory (LSTM)-based models effectively capture short- and mid-term temporal dependencies and have been widely applied to financial prediction tasks [
18,
19,
20,
21]. Transformer architectures, originally developed for natural language processing, leverage self-attention to model long-range contextual relationships [
22,
23]. Hybrid models combining convolution, recurrence, and attention mechanisms have also been explored [
24,
25,
26]. However, two fundamental challenges persist.
First, financial time series evolve across multiple temporal scales. Short-term fluctuations reflect microstructure dynamics, mid-term patterns capture trend formation, and long-term behaviors arise from structural and macroeconomic forces. Single-family architectures often fail to capture this full spectrum of temporal dependencies, particularly under data scarcity or regime shifts.
Second, hybrid deep models introduce a large number of interacting hyperparameters, including hidden dimensions, layer depths, attention configurations, expansion ratios, dropout rates, and learning rates. Model performance is highly sensitive to these design choices, and manual tuning is inefficient, heuristic, and prone to suboptimal solutions [
27].
Recent advances in large language models have demonstrated the effectiveness of attention-based architectures in modeling long-range dependencies and complex sequence structures [
28]. These models show that sequence elements can be treated as contextualized tokens whose representations evolve through both local interactions and global reasoning. This perspective naturally extends to financial time series, where each time step can be viewed as a financial token shaped by both immediate market movements and broader contextual influences. However, financial sequences differ from textual data in their higher noise levels, stronger non-stationarity, and greater sensitivity to hyperparameter configurations. These characteristics motivate a purpose-driven architecture that integrates recurrent and attention-based mechanisms for multi-scale temporal reasoning and employs principled optimization to configure the model in a data-driven manner.
To address these challenges, we propose FATE-Net, a Financial Attention-Driven Temporal Evolution Network that unifies LSTM-based local temporal modeling, Transformer-based global contextual refinement, and multi-objective evolutionary hyperparameter optimization. The framework draws inspiration from attention-driven sequence modeling in large language models but adapts these ideas to the unique characteristics of financial data. The LSTM module captures short- and mid-term temporal evolution, while the attention mechanism enables long-range reasoning across the entire sequence. A Multi-Objective Particle Swarm Optimization strategy is incorporated to automatically search the hyperparameter space, jointly minimizing prediction error and enhancing model robustness.
The main contributions of this work are summarized as follows.
We propose FATE-Net, a hybrid deep learning framework that integrates LSTM-based local temporal modeling with Transformer-based global contextual refinement to capture multi-scale dependencies in financial time series.
We formulate the configuration of FATE-Net as a multi-objective optimization problem and employ Multi-Objective Particle Swarm Optimization (MOPSO) to automatically identify hyperparameter settings that balance relative and absolute prediction accuracy.
This study aims to solve the dual challenges of hyperparameter sensitivity and performance trade-offs by: (1) formulating the FATE-Net configuration as an automated multi-objective optimization problem to eliminate manual tuning bias; (2) employing MOPSO to identify settings that simultaneously balance relative and absolute prediction accuracy; and (3) ensuring architectural robustness through a unified, data-driven search across the complex design space.
Extensive experiments demonstrate that FATE-Net outperforms a wide range of baseline models, including Support Vector Regression (SVR), Gradient Boosting Machine (GBM), Recurrent Neural Network (RNN), LSTM, Transformer, and LSTM–Transformer.
Overall, this work bridges ideas from attention-based sequence modeling, recurrent neural architectures, and multi-objective evolutionary optimization. The resulting framework demonstrates that integrating multi-scale temporal reasoning with principled hyperparameter search yields a powerful and robust solution for financial time-series forecasting.
2. Related Works
Stock price prediction spans traditional statistical models, classical machine learning methods, deep learning architectures, and hybrid optimization-enhanced frameworks. This section reviews the most relevant literature from the perspective of the core challenges identified in financial time-series forecasting, nonlinearity, multi-scale temporal dependencies, and hyperparameter sensitivity, and highlights the limitations that motivate the development of FATE-Net.
2.1. Traditional and Machine Learning-Based Approaches
Early stock prediction models relied on linear statistical methods such as AR, ARMA, and ARIMA [
29]. Although effective for stationary signals, these models assume linearity and weak temporal dependence, making them unsuitable for highly volatile and nonlinear financial markets. Machine learning methods such as k-Nearest Neighbors (kNNs) [
5], Support Vector Machine (SVM) [
6], decision trees, and ensemble models (e.g., GBM, Random Forests) improve flexibility but still struggle to capture long-range dependencies and regime shifts [
30]. These approaches treat time steps largely independently and lack mechanisms for modeling temporal evolution, limiting their predictive robustness.
However, traditional and ML-based models cannot capture hierarchical temporal patterns or nonlinear interactions, and they lack the sequence modeling capabilities required for financial forecasting.
2.2. Deep Learning for Financial Time Series
Deep learning has become the dominant paradigm for stock prediction due to its ability to model nonlinear and high-dimensional temporal patterns [
31,
32]. LSTM-based models [
18,
19,
33] effectively capture short- and mid-term dependencies through gated memory mechanisms. Variants such as Bidirectional Long Short-Term Memory (BiLSTM), Convolutional Neural Network (CNN)–LSTM hybrids, and decomposition-enhanced LSTM frameworks [
24,
25,
34,
35,
36,
37] further improve feature extraction. However, LSTM architectures inherently struggle with long-range dependencies due to their sequential nature and gradient decay.
Transformer-based models have recently been introduced into financial forecasting [
22,
23,
38]. Self-attention enables global contextual modeling, and the success of large language models (LLMs) demonstrates the power of attention-driven sequence reasoning [
39]. Several studies apply Transformer variants to stock prediction, often combined with decomposition, frequency-domain analysis, or attention enhancements.
However, pure LSTM models fail to capture long-range dependencies, while Transformer models often require large datasets, are sensitive to noise, and may overfit financial sequences.
Existing hybrid models typically combine components heuristically without a principled mechanism to balance local and global temporal modeling.
2.3. Hybrid Architectures and Optimization-Based Models
Hybrid models combining CNN, RNN, and attention mechanisms have been proposed to capture multi-scale temporal features [
24,
40]. Other works incorporate clustering, decomposition, or graph structures to enhance representation learning [
25,
41,
42,
43]. However, these architectures introduce numerous hyperparameters whose interactions are complex and difficult to tune manually.
To address hyperparameter sensitivity, evolutionary algorithms such as genetic algorithms, differential evolution, and Particle Swarm Optimization (PSO) have been applied to optimize neural networks [
27,
44,
45]. Multi-objective optimization methods, including MOPSO, have been used in domains such as carbon emission prediction and drug discovery, demonstrating strong global search capabilities [
46,
47].
However, existing hybrid models rarely integrate optimization into the architectural design itself.
Most works treat hyperparameter tuning as a separate step, rely on single-objective optimization, or optimize only a subset of parameters.
They do not jointly address the dual challenges of multi-scale temporal modeling and hyperparameter sensitivity.
2.4. Gap Analysis and Motivation for FATE-Net
Based on the above literature, three major gaps remain:
1. Existing models either focus on local temporal continuity (LSTM) or global context (Transformer), but few provide a principled integration of both perspectives. 2. While Transformers have been applied to financial data, most works do not fully leverage the analogy between financial time steps and “tokens” in LLMs, nor do they adopt the iterative refinement philosophy central to modern attention-based architectures. 3. Current hybrid models rely heavily on manual tuning or single-objective optimization, which cannot adequately balance accuracy, stability, and generalization.
2.5. Positioning of FATE-Net
FATE-Net is designed to address these gaps through:
1. A purpose-driven hybrid architecture that integrates LSTM-based local temporal evolution with Transformer-based global contextual refinement, inspired by the sequence reasoning capabilities of LLMs. 2. A multi-objective evolutionary optimization strategy that automatically configures the architecture and training hyperparameters, improving robustness and reducing reliance on expert heuristics. 3. A unified optimization-enhanced forecasting framework that tightly couples temporal modeling and hyperparameter search, enabling FATE-Net to achieve superior predictive performance on real-world financial data.
In summary, while prior works have made significant progress in financial time-series forecasting, they fall short in addressing the combined challenges of multi-scale temporal modeling, attention-driven contextual reasoning, and automated hyperparameter optimization. FATE-Net fills this gap by integrating LLM-inspired attention mechanisms with evolutionary optimization into a coherent and principled forecasting framework.
3. Methodology
FATE-Net is developed as a purpose-driven forecasting framework that unifies multi-scale temporal modeling with evolutionary hyperparameter optimization. Rather than adopting a conventional hybrid design that merely stacks LSTM and Transformer layers, FATE-Net is explicitly constructed to address the fundamental challenges of financial time-series prediction, including nonlinear temporal evolution, long-range contextual dependencies, and sensitivity to architectural and training hyperparameters. Drawing inspiration from the sequence reasoning capabilities of large language models, the framework interprets each time step as a “financial token” whose meaning evolves through both local sequential patterns and global contextual interactions. This perspective motivates the integration of LSTM-based local temporal encoding with attention-driven global refinement, enabling the model to capture the full spectrum of temporal dependencies present in real-world financial data. The following sections describe the architectural components and the multi-objective optimization strategy that jointly form the FATE-Net forecasting pipeline.
3.1. Temporal Evolution Modeling in FATE-Net
Financial time series are shaped by complex temporal dynamics that unfold across multiple scales. Short-term fluctuations arise from rapid microstructure changes, mid-term patterns reflect trend formation and market adjustment, and long-range dependencies emerge from structural shifts driven by macroeconomic or policy factors. A unified temporal modeling framework must therefore capture both local continuity and global contextual interactions to accurately represent the evolution of stock prices.
Recurrent architectures such as LSTM are effective at modeling short- and mid-term sequential dependencies, as their gated structure preserves local temporal coherence and mitigates gradient decay. However, their ability to capture long-range interactions is inherently limited by their sequential processing nature. In contrast, Transformer-based architectures excel at modeling global dependencies through self-attention, yet they may become unstable or prone to overfitting when applied directly to noisy financial data without prior temporal structuring.
Recent advances in large language models demonstrate that attention mechanisms can refine token representations by enabling each token to contextualize itself within the entire sequence. Drawing inspiration from this paradigm, each time step in a financial series can be interpreted as a “financial token” whose meaning is shaped not only by its immediate neighbors but also by broader market conditions. This perspective motivates a two-stage temporal modeling strategy in FATE-Net: the LSTM module first establishes a coherent local temporal representation, and the subsequent attention mechanism enriches this representation by incorporating global contextual information.
The resulting multi-scale temporal evolution module integrates recurrent reasoning with attention-driven refinement. By combining the strengths of both architectures, FATE-Net is able to model nonlinear, volatile, and regime-sensitive financial dynamics more effectively than single-family approaches. This design forms the foundation for the model’s ability to capture the full spectrum of temporal dependencies present in real-world stock price movements.
3.1.1. Stage I: LSTM-Based Local Temporal Evolution
The first stage of FATE-Net focuses on modeling short- and mid-term temporal dependencies, which are essential for capturing local continuity and rapid fluctuations in financial time series. Given an input sequence , where denotes the feature vector at time t, the LSTM cell transforms the sequence into a set of hidden representations that encode local temporal evolution.
At each time step, the LSTM computes three gating vectors and a candidate cell update:
The forget gate
determines how much past information should be retained, the input gate
controls the incorporation of new information, and the output gate
regulates how much of the updated memory is exposed to the next layer. The candidate vector
represents the new information proposed for inclusion in the cell state.
The internal memory state and hidden representation are then updated as:
These equations show how the LSTM balances historical memory with newly observed patterns. Overall, the non-linear modeling capability in Stage I is primarily driven by the gating mechanisms and activation functions. Specifically, the sigmoid functions (
) in the forget, input, and output gates act as non-linear filters that selectively retain or discard information based on current market volatility. Simultaneously, the
function in the candidate cell update maps the input and previous hidden states into a non-linear space, enabling the model to capture the rapid fluctuations and non-stationary dynamics inherent in short-term stock price movements.
To highlight the temporal propagation mechanism, the cell state can be expanded as:
which makes explicit how earlier information is progressively attenuated through the product of forget gates. This formulation reveals the LSTM’s ability to preserve relevant short-term patterns while gradually discounting older information, a property particularly important for modeling the rapidly evolving dynamics of stock prices.
The final output of Stage I is the sequence of hidden states:
which serves as the locally coherent temporal representation to be further refined by the attention-based global contextual module in Stage II.
3.1.2. Stage II: Attention-Based Global Contextual Refinement
Although the LSTM module provides a coherent representation of short- and mid-term temporal dynamics, financial markets are also influenced by interactions that unfold over much longer horizons. Macroeconomic cycles, policy adjustments, and structural regime shifts often exert delayed or cumulative effects on price movements, making it necessary to incorporate global contextual information into the temporal representation. To address this requirement, FATE-Net adopts an attention-based refinement mechanism inspired by the sequence reasoning capabilities of large language models. This mechanism enables each time step to selectively attend to all others in the sequence, allowing the model to integrate long-range dependencies that are not captured by recurrent processing alone.
The refinement process begins by projecting the LSTM output matrix
into three learned representation spaces:
The matrices
,
, and
represent queries, keys, and values, respectively. This transformation allows the model to compute relevance scores between all pairs of time steps, forming the foundation for global contextual reasoning.
The scaled dot-product attention mechanism is then applied:
The numerator inside the softmax quantifies the similarity between each query–key pair, while the normalization ensures that the resulting weights form a probability distribution. This operation enables the model to emphasize time steps that carry stronger contextual relevance for the current prediction.
To illustrate the nonlinear refinement performed by the attention mechanism, the softmax operator can be approximated using a truncated series expansion:
which highlights how similarity scores are transformed into normalized contextual weights through a nonlinear mapping. Furthermore, the attention mechanism performs global non-linear refinement by utilizing the Softmax operator. As shown in the truncated expansion, the Softmax mapping applies a non-linear transformation to the query–key similarity scores, effectively amplifying critical market signals while suppressing noise. This process allows FATE-Net to discern complex, long-range contextual relationships that linear models fail to identify. The subsequent Feed-Forward Network (FFN) and Layer Normalization further enrich this capacity by executing high-dimensional non-linear projections, stabilizing the representation of regime-sensitive financial dynamics.
To capture diverse patterns of temporal interaction, multi-head attention is employed:
Each attention head learns a distinct subspace of temporal relationships, enabling the model to jointly capture trend-level dependencies, cyclical patterns, and event-driven influences. This multi-head structure mirrors the design principles of LLMs and significantly enhances the model’s ability to reason over long sequences.
The Transformer encoder layer further refines the representation through residual connections and feed-forward transformations:
which stabilizes optimization and enriches the expressive capacity of the model by combining global contextual information with the original LSTM features.
The final temporal representation used for prediction is extracted from the last time step:
which now encodes both local sequential dynamics and long-range contextual dependencies.
The predicted closing price is then computed as:
Through this attention-based refinement stage, FATE-Net extends beyond purely local modeling and acquires the ability to perform long-range temporal reasoning.
In summary, FATE-Net captures non-linearity through a hierarchical approach: Stage I leverages recurrent gates to model dynamic local non-linearity, while Stage II employs multi-head attention and feed-forward transformations to reconstruct the global feature space non-linearly. This synergy ensures that the framework can effectively represent the full spectrum of temporal dependencies in volatile financial time series.
3.2. Multi-Objective Evolutionary Hyperparameter Optimization
The hybrid architecture of FATE-Net introduces a large number of interacting hyperparameters, including recurrent dimensions, attention configurations, and optimization settings. These hyperparameters jointly determine the model’s expressive capacity and training stability, yet their interactions are highly nonlinear and difficult to tune manually. To enable a fully data-driven configuration process, FATE-Net incorporates a multi-objective evolutionary optimization strategy based on MOPSO. This approach is inspired by the automated architecture search paradigm widely adopted in large language models, where evolutionary or gradient-free methods are used to explore high-dimensional design spaces efficiently. In FATE-Net, MOPSO is employed to simultaneously minimize two complementary objectives, MAPE and RMSE, ensuring that the resulting configuration achieves both relative and absolute accuracy.
The selection of MOPSO as the core optimization engine is driven by the specific theoretical requirements of the hybrid architecture. Conventional strategies, such as Grid Search and Random Search, suffer from the curse of dimensionality and are computationally intractable given the extensive and highly interacting parameter space of the LSTM-Transformer components. While Bayesian Optimization (BO) offers sample efficiency, standard BO frameworks are inherently single-objective. The financial forecasting task explicitly requires balancing competing metrics—such as minimizing relative deviations while heavily penalizing large absolute errors. MOPSO natively supports this multi-objective paradigm by maintaining a diverse set of non-dominated solutions along a Pareto front. To prevent meta-overfitting and ensure robust generalization, the internal parameters of the MOPSO algorithm (e.g., inertia weight , cognitive coefficient , and social coefficient ) were strictly set to standard empirical values established in evolutionary computation literature, thereby focusing the optimization entirely on the temporal network’s configuration.
3.2.1. Particle Encoding
Each particle in the swarm represents a candidate hyperparameter configuration. The particle vector is defined as:
This encoding allows the optimizer to jointly adjust the LSTM capacity, Transformer depth, attention dimensionality, feed-forward expansion ratio, regularization strength, and learning rate. By exploring this unified representation, MOPSO can discover configurations that balance model complexity with generalization performance.
The optimization process evaluates each particle using two objective functions:
MAPE emphasizes relative prediction accuracy, while RMSE penalizes large absolute deviations. Optimizing both objectives ensures that the resulting model performs well across different error scales.
3.2.2. Evolutionary Update
During each iteration, particles update their velocity and position based on three components: inertia, personal experience, and global guidance. The update rules are:
The inertia term
preserves momentum from previous updates, the cognitive term encourages each particle to return to its historically best configuration, and the social term guides the swarm toward the best solution found by the population. Together, these components enable efficient exploration of the hyperparameter space while avoiding premature convergence.
Specifically, the optimization process is governed by dual stopping criteria: (1) a maximum iteration count of 100 to ensure computational efficiency and (2) a stability threshold where the process terminates if the non-dominated solutions in the external archive remain unchanged for 10 consecutive generations.
It should be noted that the integration of non-dominated solutions into an external archive is essential because financial time series are characterized by non-stationarity and noise. A solution that minimizes absolute error may not necessarily achieve the best relative accuracy. By maintaining a set of non-dominated solutions, FATE-Net avoids the bias of single-objective optimization and provides a robust pool of candidate architectures that perform consistently across different error scales.
To maintain diversity along the Pareto front, MOPSO employs a crowding-distance mechanism:
To compute the crowding distance , all non-dominated solutions in the archive are first sorted in ascending order according to each objective function . The distance is then calculated as the sum of the normalized differences between the objective values of the two adjacent solutions ( and ).
Specifically, this metric quantifies how isolated a particle is within the objective space. Particles with larger crowding distances are preferred during selection, ensuring that the optimizer explores a broad range of trade-offs between MAPE and RMSE rather than collapsing to a single narrow region. Accordingly, the final optimal solution is selected from the external archive based on the maximum crowding distance. This selection strategy is prioritized to enhance the diversity of the swarm’s search space, preventing the model from over-fitting to a specific localized error pattern. This mechanism ensures that the chosen configuration of LSTM and Transformer components is both empirically optimized and theoretically stable under volatile market conditions.
Through this evolutionary process, MOPSO identifies hyperparameter configurations that enhance both predictive accuracy and model robustness, enabling FATE-Net to achieve superior performance without manual intervention.
3.3. Integrated Optimization-Enhanced Forecasting Pipeline
To fully exploit the complementary strengths of multi-scale temporal modeling and evolutionary hyperparameter optimization, FATE-Net is implemented as an integrated end-to-end forecasting pipeline, as shown in
Figure 1.
The process begins with a feature extraction layer that filters available price data to isolate relevant predictors. This is followed by Stage I: LSTM, which serves as a local temporal encoder to capture sequential dependencies and non-linear fluctuations through its internal gating mechanisms involving sigmoid and tanh activation functions. The locally coherent features are then fed into Stage II, a global refinement block based on the Transformer architecture. Within this block, a multi-head attention mechanism calculates contextual relevance across the entire sequence, supported by Add & Norm (residual connections) to ensure stable gradient flow. A feed-forward network further processes these global interactions, performing additional non-linear projections. Finally, the refined state is passed through a linear layer and a softmax operator to generate the final forecast prices. This hierarchical design ensures that both short-term volatility and long-range structural patterns are systematically integrated into the prediction through multiple stages of non-linear mapping.
Hybrid deep learning models often suffer from unstable training dynamics when architectural design and hyperparameter tuning are treated as separate processes. Such decoupling can lead to mismatched component scales, suboptimal learning rates, and inconsistent convergence behavior. In contrast, large language models are trained through tightly coupled pipelines in which architecture, optimization strategy, and data interact cohesively throughout the training process. FATE-Net adopts a similar philosophy by embedding MOPSO directly into the model development workflow, ensuring that the final architecture is not only theoretically well-structured but also empirically optimized for the target financial task.
Within this unified pipeline, the temporal modeling components, comprising LSTM-based local evolution and attention-based global refinement, are evaluated under a wide range of hyperparameter configurations generated by the MOPSO search process. Each configuration is assessed using multi-objective criteria, allowing the optimizer to balance relative and absolute prediction accuracy while maintaining robustness across volatile market conditions. The integration of evolutionary search with model training ensures that architectural choices, such as hidden dimensions, attention depth, and regularization strength, are aligned with the empirical behavior of the forecasting task rather than predetermined manually.
Algorithm 1 summarizes the optimization procedure. The algorithm maintains a population of candidate configurations, iteratively evaluates their performance, and updates them based on both individual experience and collective guidance from the swarm. A Pareto-based archive preserves non-dominated solutions, while crowding distance encourages diversity along the trade-off frontier. Through this iterative process, the pipeline converges toward a set of hyperparameters that jointly minimize MAPE and RMSE, yielding a model that is both accurate and stable.
Upon reaching the termination criteria in
Figure 1, the final optimal solution
is extracted from the Pareto archive. While all solutions in the archive are mathematically equivalent in terms of non-dominance, we select the particle with the maximum crowding distance as the final configuration. This choice is motivated by the need for maximum generalization; a solution in a sparser region is less likely to be an artifact of over-fitting to specific market noise.
| Algorithm 1 Multi-Objective Particle Swarm Optimization (MOPSO). |
- 1:
Input: Objective functions - 2:
Output: Pareto-optimal hyperparameter vector - 3:
Initialize particle positions and velocities - 4:
Initialize personal bests: - 5:
Initialize external archive with all non-dominated - 6:
while t < Max_Iterations and Pareto archive is not stable do - 7:
for each particle do - 8:
Evaluate multi-objective fitness: - 9:
Update personal best using Pareto dominance: - 10:
end for - 11:
Update archive with all non-dominated particles: - 12:
Compute crowding distance for each : - 13:
Select global leader from using roulette wheel based on - 14:
for each particle do - 15:
- 16:
- 17:
end for - 18:
end while - 19:
Select final solution from based on crowding distance or decision preference - 20:
return
|
This integrated pipeline ensures that FATE-Net achieves high predictive accuracy, strong robustness under volatile market conditions, and adaptability to the complex temporal structures inherent in financial time series.
4. Experiments
4.1. Experimental Environment
The experimental evaluation of FATE-Net was conducted within a controlled and reproducible computational environment to ensure fairness and consistency across all comparative models. All implementations were developed using the PyTorch 2.3.1 deep learning framework, which provides efficient tensor operations and GPU-accelerated training suitable for large-scale temporal modeling.
To support the computational demands of multi-scale sequence modeling and multi-objective evolutionary optimization, all experiments were executed on a workstation equipped with an NVIDIA GeForce RTX 4070 GPU (16 GB VRAM) and an AMD Ryzen 7 7745HX CPU. The system operated under Windows 11, providing a stable environment for both GPU-intensive training and iterative MOPSO-based hyperparameter search.
Data preprocessing, feature normalization, and exploratory analysis were performed using a suite of Python (version 3.8) scientific libraries, including NumPy (version 1.24.3), Pandas (version 2.0.3), and Sklearn (version 1.3.2). Visualization of temporal patterns, prediction curves, and Pareto fronts was carried out using Matplotlib (version 3.7.2). The MOPSO optimization process was implemented using the Pyswarm (version 0.6) library, while all development and debugging were conducted within the VSCode environment. This configuration ensures that the evaluation of FATE-Net reflects realistic computational constraints and provides a reliable basis for comparing its performance against baseline models.
4.2. Data Preprocessing
The experiments in this study are conducted using the historical stock price dataset of BYD Co., Ltd., sourced from a publicly available Kaggle repository (
https://www.kaggle.com/datasets/eumenesxy/byd-price-until-20231122 (accessed on 6 January 2026)). As a leading enterprise in China’s electric vehicle and autonomous driving industry [
17], BYD exhibits strong market activity and pronounced price volatility, making its stock an appropriate benchmark for evaluating the forecasting capability of FATE-Net.
The dataset contains daily trading records from January 2015 to November 2023, including opening, closing, high, and low prices; previous closing price; volume-weighted average price; yield; trading volume; turnover; and trading status.
Figure 2 illustrates the closing price trajectory, which clearly reflects multi-scale temporal patterns such as long-term upward trends, short-term fluctuations, and abrupt regime shifts. These characteristics align with the motivation behind the multi-scale temporal modeling design of FATE-Net.
To ensure chronological integrity and avoid information leakage, the dataset is partitioned into training, validation, and testing subsets using a forward-in-time split. The training set covers January 2015 to April 2020, the validation set spans April 2020 to February 2022, and the testing set includes February 2022 to November 2023, as summarized in
Table 1. This partitioning strategy ensures that model evaluation reflects realistic forecasting conditions.
Stock price features vary significantly in magnitude and exhibit irregular fluctuations. Feeding raw values directly into deep models may lead to unstable gradients, slow convergence, or biased optimization during the MOPSO search process. To ensure numerical stability and consistent feature scaling, min–max normalization is applied to each feature dimension:
This transformation maps all features to the range , facilitating stable training for both LSTM and Transformer components.
Missing values in the dataset are handled using linear interpolation based on adjacent valid observations:
which preserves temporal continuity and avoids introducing artificial discontinuities. This preprocessing pipeline ensures that the resulting dataset is clean, normalized, and temporally coherent, providing a reliable foundation for training FATE-Net and evaluating its forecasting performance.
4.3. Optimizer Selection and Evaluation Metrics
The choice of optimizer has a direct impact on the convergence behavior, stability, and final predictive performance of deep learning models. Given that FATE-Net integrates both recurrent and attention-based components, the optimization landscape is highly non-convex and sensitive to gradient dynamics. To ensure reliable training, several commonly used optimizers provided by PyTorch, such as Adam, SGD, RMSprop, Adagrad, Adamax, Nadam, Adadelta, and Rprop, are systematically evaluated.
To quantify the predictive performance of each optimizer, four widely adopted regression metrics are employed: Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Mean Absolute Percentage Error (MAPE), and the coefficient of determination (
). These metrics jointly measure absolute deviation, squared deviation, relative deviation, and goodness-of-fit, providing a comprehensive assessment of forecasting accuracy. They are defined as follows:
Figure 3 illustrates the training error trajectories of different optimizers, while
Table 2 summarizes their quantitative performance. Among all candidates, Adam consistently achieves the lowest MAE, RMSE, and MAPE, while also attaining the highest
value. This superior performance can be attributed to Adam’s adaptive learning rate mechanism and momentum-based updates, which are well suited for the heterogeneous gradient patterns arising from the hybrid LSTM–Transformer architecture.
Consequently, Adam is selected as the default optimizer for training FATE-Net in all subsequent experiments.
4.4. Optimization Results of MOPSO for LSTM–Transformer
To establish a reference point for subsequent optimization, the LSTM–Transformer model is first trained using manually selected hyperparameters commonly adopted in prior studies. The model is trained on the BYD dataset and evaluated on the chronologically separated test set.
Table 3 summarizes the baseline quantitative performance, and
Figure 4 visualizes the corresponding prediction trajectory.
The baseline model captures the general upward trend of BYD’s stock price; however, several regions exhibit noticeable deviations. In particular, around time step 100, the model persistently underestimates the closing price, suggesting that the manually chosen hyperparameters do not fully leverage the representational capacity of the hybrid architecture. This motivates the use of an automated search strategy capable of exploring a broader and more complex hyperparameter space.
To improve predictive performance, MOPSO is applied to optimize the hyperparameters of the LSTM–Transformer model. During the iterative search, each particle encodes a candidate configuration of LSTM and Transformer parameters, and its fitness is evaluated using the multi-objective criteria defined earlier. After convergence, the optimal hyperparameter configuration is obtained, as shown in
Table 4.
Using this optimized configuration, the model is retrained to obtain the final FATE-Net predictor. The improved performance is reported in
Table 5, and the corresponding prediction curve is shown in
Figure 5.
Compared with the baseline model, MAE, RMSE, and MAPE decrease by 2.062, 2.904, and 1.24%, respectively, while the coefficient of determination increases from 0.978 to 0.997. These improvements demonstrate that MOPSO effectively identifies a more suitable hyperparameter configuration, enabling the hybrid architecture to better capture both local temporal patterns and long-range dependencies.
Figure 6 provides a direct comparison of predictions before and after optimization, clearly showing the enhanced alignment between predicted and actual closing prices.
4.5. Performance Evaluation Against Other Models
To further assess the effectiveness of the proposed FATE-Net framework, its predictive performance is compared with several widely used baseline models, including Support Vector Regression (SVR), Gradient Boosting Machine (GBM), Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM), Transformer, and the unoptimized LSTM–Transformer hybrid. These models represent classical machine learning approaches, recurrent architectures, attention-based architectures, and hybrid deep learning baselines, providing a comprehensive benchmark for evaluating the advantages of the proposed method.
Table 6 summarizes the quantitative performance of all models across four evaluation metrics (MAE, RMSE, MAPE, and
), while
Figure 7 presents the corresponding prediction curves.
Figure 8 provides an aggregated comparison, highlighting the relative performance differences among all models. The results clearly show that FATE-Net achieves the closest alignment with actual stock prices, significantly outperforming all baseline approaches across all evaluation metrics. This improvement demonstrates the effectiveness of integrating multi-scale temporal modeling with multi-objective evolutionary optimization.
4.6. Model Performance Consistency and Generalization Ability Analysis
Beyond the superior out-of-sample performance relative to baseline models, we further evaluate the generalization ability and overfitting risk of FATE-Net by comparing its predictive performance across the chronologically split training, validation, and testing sets, as presented in
Table 7.
FATE-Net reports MAE values of 0.928, 1.012, and 1.051 on the training, validation, and testing sets, respectively. The RMSE values increase from 1.217 to 1.364 and then to 1.435, while the MAPE values rise slightly from 0.32% to 0.35% and 0.37%. At the same time, the values remain at 0.998, 0.997, and 0.997 across the three datasets. These results show that the prediction accuracy decreases slightly as the data move from seen samples to unseen samples, yet the overall performance level remains stable.
The MAE increases by about 13.3% from training to testing, and the difference between validation and testing is about 3.9%. RMSE and MAPE follow the same pattern. This steady growth suggests that the model appears to learn general temporal patterns that still hold when applied to new data.
The comparison between the validation and testing sets provides further insight. The values on these two datasets are very close, and no sharp decline appears in any metric. This consistency indicates that the parameter selection process does not overly adapt to the validation data. The model keeps a similar level of accuracy when it is applied to future samples that were not involved in model tuning.
The values support this observation. All three datasets show values close to 1, and the difference between them is minimal. This pattern means that the model maintains strong explanatory ability across different data splits without large fluctuations.
Taken together, the small gap among training, validation, and testing results reflects stable behavior. The results therefore suggest that FATE-Net maintains good generalization ability and does not show clear signs of overfitting under the current experimental setting.
4.7. Discussion
The experimental findings demonstrate that FATE-Net achieves substantial improvements over both classical machine learning models and contemporary deep learning architectures in short-term stock price forecasting. The performance gains primarily stem from the model’s ability to integrate multi-scale temporal representations with a principled hyperparameter optimization strategy.
The hybrid LSTM–Transformer architecture plays a central role in this improvement. The LSTM component effectively models short- and mid-term sequential dynamics, capturing local temporal continuity and volatility patterns. In contrast, the Transformer’s self-attention mechanism provides a complementary capability by modeling long-range dependencies and global contextual interactions that recurrent structures often fail to capture. This dual-path temporal representation enables FATE-Net to better characterize the nonlinear, non-stationary, and regime-sensitive behavior inherent in financial time series.
The incorporation of MOPSO further enhances predictive accuracy by enabling a data-driven search for optimal hyperparameters. Manual tuning of hybrid architectures is typically subjective and prone to suboptimal configurations, especially given the complex interactions among LSTM hidden sizes, attention heads, embedding dimensions, expansion ratios, dropout rates, and learning rates. MOPSO addresses this challenge through a multi-objective evolutionary search that jointly minimizes prediction error and improves model stability. The resulting configuration yields a more expressive and better-calibrated model, as reflected in the significant reductions in MAE, RMSE, and MAPE, and the near-perfect score achieved on the test set.
Beyond statistical superiority, these quantitative improvements carry significant economic implications for real-world investment strategies. In quantitative finance, prediction errors directly translate into execution costs and financial risk. For instance, the substantial reduction in MAPE from 1.61% (baseline) to 0.37% (FATE-Net) is highly relevant for algorithmic trading. In markets where profit margins per trade are often fractional, a 1.24% improvement in relative accuracy can significantly reduce execution slippage and optimize entry/exit timing. Furthermore, the reduction in RMSE provides crucial value for risk management. Because RMSE heavily penalizes large absolute forecasting errors, FATE-Net’s lower RMSE indicates a reduced propensity for generating extreme false signals—such as predicting a major price breakout that fails to materialize. By avoiding these catastrophic wrong-way bets, the model inherently minimizes tail risk, thereby offering the potential to improve risk-adjusted returns (e.g., Sharpe ratio) when integrated into a broader automated trading or portfolio optimization framework.
Despite these advantages, several limitations remain. The hybrid architecture introduces non-negligible computational overhead due to the Transformer component, and the iterative nature of MOPSO increases training time. These factors may restrict deployment in latency-sensitive or resource-constrained environments. Moreover, the current framework relies exclusively on historical price-based features. External information, such as macroeconomic indicators, market sentiment, and news events, plays a critical role in financial markets but is not incorporated in the present model. Integrating multimodal data sources may further enhance robustness, interpretability, and generalization across different market conditions.
Future work will explore reducing model complexity through lightweight attention mechanisms or neural architecture search (NAS), enabling real-time deployment in practical trading systems. Additionally, extending FATE-Net to incorporate heterogeneous data modalities, such as textual sentiment, cross-market signals, or macroeconomic indicators, may improve predictive performance and resilience under varying market regimes. Finally, evaluating the model under stress scenarios and across diverse financial instruments will provide deeper insights into its stability and practical applicability.
5. Conclusions
This study introduces FATE-Net, an integrated forecasting framework specifically designed to tackle the nonlinear, volatile, and multi-scale nature of financial time series. By unifying LSTM-based local temporal modeling with attention-driven global contextual refinement, the framework successfully captures both short-term sequential dynamics and long-range dependencies.
To overcome the challenge of manual hyperparameter tuning, FATE-Net incorporates MOPSO, creating a fully data-driven pipeline that optimizes architectural configurations for both relative and absolute predictive accuracy. The experimental evaluation using real-world BYD stock data yields concrete key findings that demonstrate the framework’s superior predictive capability. Most notably, the optimized framework achieved exceptional accuracy metrics, including an MAE of 1.051, an RMSE of 1.435, a MAPE of 0.37%, and an R2 of 0.997. Furthermore, the empirical results explicitly validate the integration of MOPSO; compared to the unoptimized LSTM-Transformer baseline, the MOPSO-enhanced model reduced MAE by 2.062, RMSE by 2.904, and MAPE by 1.24% while boosting the R2 score from 0.978 to 0.997. These quantitative findings clearly establish that coupling multi-scale temporal modeling with principled evolutionary optimization significantly outperforms classical machine learning models and single-family deep learning architectures. Ultimately, these findings confirm that systematically navigating the high-dimensional hyperparameter space with MOPSO eliminates reliance on manual trial and error, resulting in a model with enhanced expressiveness, stability, and generalization across varying market conditions.
While FATE-Net provides a robust foundation for univariate stock price prediction, financial markets are heavily influenced by exogenous factors. Consequently, future research will focus on integrating multimodal data sources—such as macroeconomic indicators, market sentiment, and cross-market signals—into the temporal modeling pipeline to further elevate the framework’s robustness and interpretability in complex trading environments.