Capturing Short- and Long-Term Temporal Dependencies Using Bahdanau-Enhanced Fused Attention Model for Financial Data—An Explainable AI Approach

Khansama, Rasmi Ranjan; Priyadarshini, Rojalina; Nanda, Surendra Kumar; Barik, Rabindra Kumar

doi:10.3390/fintech5010004

Open AccessArticle

Capturing Short- and Long-Term Temporal Dependencies Using Bahdanau-Enhanced Fused Attention Model for Financial Data—An Explainable AI Approach

by

Rasmi Ranjan Khansama

¹

,

Rojalina Priyadarshini

^1,*,

Surendra Kumar Nanda

¹ and

Rabindra Kumar Barik

^2,*

¹

Department of Computer Science and Engineering, C.V. Raman Global University, Bhubaneswar 752054, Odisha, India

²

School of Computer Applications, Kalinga Institute of Industrial Technology Deemed to Be University, Bhubaneswar 751024, Odisha, India

^*

Authors to whom correspondence should be addressed.

FinTech 2026, 5(1), 4; https://doi.org/10.3390/fintech5010004

Submission received: 24 October 2025 / Revised: 4 December 2025 / Accepted: 29 December 2025 / Published: 7 January 2026

Download

Browse Figures

Versions Notes

Abstract

Prediction of stock closing price plays a critical role in financial planning, risk management, and informed investment decision-making. In this study, we propose a novel model that synergistically amalgamates Bidirectional GRU (BiGRU) with three complementary attention techniques—Top-k Sparse, Global, and Bahdanau Attention—to tackle the complex, intricate, and non-linear temporal dependencies in financial time series. The proposed Fused Attention Model is validated on two highly volatile, non-linear, and complex- patterned stock indices: NIFTY 50 and S&P 500, with 80% of the historical price data used for model learning and the remaining 20% for testing. A comprehensive analysis of the results, benchmarked against various baseline and hybrid deep learning architectures across multiple regression performance metrics such as Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Mean Absolute Percentage Error (MAPE), and R² Score, demonstrates the superiority and noteworthiness of our proposed Fused Attention Model. Most significantly, the proposed model yields the highest prediction accuracy and generalization capability, with R² scores of 0.9955 on NIFTY 50 and 0.9961 on S&P 500. Additionally, to mitigate the issues of interpretability and transparency of the deep learning model for financial forecasting, we utilized three different Explainable Artificial Intelligence (XAI) techniques, namely Integrated Gradients, SHapley Additive exPlanations (SHAP), and Attention Weight Analysis. The results of these three XAI techniques validated the utilization of three attention techniques along with the BiGRU model. The explainability of the proposed model named as BiGRU based Fused Attention (BiG-FA), in addition to its superior performance, thus offers a robust and interpretable deep learning model for time-series prediction, making it applicable beyond the financial domain.

Keywords:

stock market prediction; Explainable Artificial Intelligence; Sparse Attention; Global Attention; Bahdanau Attention

JEL Classification:

C45; C53; C22; C58; G17

1. Introduction

Forecasting the future price of different stock indices and equities by using their past sequential data is an example of forecasting task applied to the financial market domain. According to studies [1], there is a great amount of interest among researchers, investors, and traders in building a forecasting model capable of achieving high accuracy in stock price prediction. This is due to the fact that accurate forecasts helps the traders and investors to take buy or sell decisions, which typically result in substantial financial gains. According to the efficient-market hypothesis, better returns can be obtained with higher risk rather than focusing on analysis or predictions to find the most profitable stocks. This suggests that price volatility that is not attributable to new data is random and fundamentally unpredictable [2]. On the other hand, others who disagree with this viewpoint contend that some data analytical approaches, such as machine learning models and technical indicators, demonstrate competence in unearthing hidden and complex signals that can be utilized for forecasting future stock price trends. The anticipation of the stock market is difficult and complex to navigate due to the fact that it is dependent on a wide variety of dynamic elements. These factors include macroeconomic indicators, alterations in government policy, public sentiment, and global geopolitical events [3,4,5,6].

Earlier studies in the field of time-series forecasting were mostly on traditional statistical models that were intended to identify patterns in sequential data. The Autoregressive Moving Average (ARMA) framework, which integrates linear regression, Autoregressive (AR), and Moving Average (MA) components, is one of the methodologies that is utilize most frequently [7]. Also, financial time series have been modeled with Autoregressive Conditional Heteroskedasticity (ARCH) and its generalized version, Generalized Autoregressive Conditional Heteroskedasticity (GARCH), to capture the uncertainty in stock market time series [8,9]. Later on, researchers have turned to machine learning methods to get more precise time-series forecasts [10]. This is because standard models are not able to extract non-linear and chaotic dependencies in a stock market.

Machine learning (ML) methods are very useful for modeling patterns that are not linear or stationary [11]. There have been diverse array of machine learning models utilized to capture stock patterns and volatility. Support Vector Regression (SVR) [12] and Artificial Neural Networks (ANNs) [13] are some of these models. These models leverage past financial times-series sequence and technical indicators to find relationships over time [14]. The accuracy of these machine learning model’ predictions has been shown to be superior to that of classic statistical models like Autoregressive Integrated Moving Average (ARIMA) and GARCH, particularly when it comes to catching trends that are only short-term in nature. Ensemble techniques, including bagging, boosting, and stacking, have further improved prediction performance by lowering the amount of variance and bias [15]. The most common problem with these models is memorizing noise and random fluctuations, which occurs when the model has a tendency to memorize training patterns but fails to generalize to test data. Despite their robust performance, ML models suffer from this issue. In addition, machine learning models have difficulty modeling long-range temporal relationships, which are critical for financial time-series predictions [16]. Furthermore, unpredictable stock price fluctuations hinder ML models performance. Because of these limitations, past research on predicting financial time series has looked at deep learning systems. These deep learning structures can better capture short- and long-term temporal patterns.

To better deal with the non-linear and volatile character of financial time series, modern studies on predicting the stock market have included deep learning (DL) methods such as Recurrent Neural Networks (RNNs), like Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU). Models like Bidirectional LSTM (BiLSTM) and Bidirectional GRU (BiGRU), which are bidirectional versions of LSTM and GRU networks, have shown to be very good at handling long-term dependencies in temporal data [17,18,19,20,21]). Hybrid deep learning frameworks have also been suggested to help find patterns in stock data that are noisy and fluctuate a lot. Convolutional Neural Network (CNN)-GRU is one example of this kind of framework; it extracts spatial characteristics and models temporal sequences at the same time. The performance of these models has led to leveraging data-driven and adaptable methods in place of classical neural network models when it comes to understanding the complex behavior of the financial markets [22].

Attention mechanisms [23] have proven a powerful technique in classification along with regression modeling, which functions by focusing on the most important time steps in large sequences in deep learning models. Researchers first came up with attention mechanisms for neural machine translation, but they have subsequently been adopted in many distinct fields, including speech analysis, picture identification, and time-series forecasting. These mechanisms make predictions more accurate and models stronger by letting the model dynamically select the most impactful time steps in the input financial time series. Also, this mechanism helps lessen the effects of noise and swings in financial time-series sequences by giving more weight to certain time steps.

DL algorithms with attention mechanisms have made a lot of progress in modeling the future price of financial time series, but there are still a number of important shortcomings that need to be mitigated. There are a lot of studies on stock market forecasting that only use one kind of attention mechanism, and they fail to acknowledge the significance of using an array of attention methods in order to capture multi-scale temporal dependencies. Also, there is not enough of an emphasis placed on developing an interpretable model, which is something that is really necessary when making important financial decisions. In addition, although Bahdanau Attention has been shown to be useful in fields such as energy forecasting and Natural Language Processing (NLP), it is not yet being employed to a significant extent in the context of stock market forecasting. The existence of these research gaps in the literature has prompted us to develop a framework that is robust, trustworthy and interpretable, which makes use of the advantages offered by complementary attention mechanisms.

The main contributions of our study are as follows:

In this study, we propose a novel deep learning architecture combining BiGRU with three different but complementary attention mechanisms—Top-k Sparse, Global, and Bahdanau Attention—to capture multi-scale temporal dependencies in financial time series. This fusion addresses the limitations of earlier studies that utilized either single attention or simple recurrent architecture to capture both short- and long-term temporal dependencies.
Secondly, In this study, we uniquely leverage Bahdanau Attention within a BiGRU-based architecture to model the alignment between temporal features and target outputs, enabling dynamic focus on the most relevant time steps. This integration contributes to the advancement of interpretable and accurate forecasting models in the financial domain.
Thirdly, our proposed model is evaluated on two highly volatile, non-linear, and complex pattern stock indices, NIFTY 50 and S&P 500, and outperforms all baseline and hybrid deep learning models across multiple performance metrics (MAE, RMSE, MAPE, R²). The experimental results on NIFTY 50 and S&P 500 demonstrates the strong generalizability and robustness of our proposed model.
Finally, many studies on stock price forecasting overlooked the interpretability of the black box deep learning model for prediction of future value of stock price, which is essential for financial decision-making. To address this gap three interpretability techniques—Attention Weight Visualization, Integrated Gradients, and SHAP—are applied here to provide insights and transparency of model’s predictions. The results of the interpretability techniques on both stock indices confirm that the proposed BiGRU based Fused Attention Model learns to focus on the most recent and relevant time steps, enhancing trust and transparency in financial predictions. The utilization of various Explainable AI (XAI) techniques along with the proposed model for financial forecasting will be beneficial to the academicians doing research in this field and can serve as reference for building a deep learning model suitable for real-time forecasting.

2. Literature Review

Stock market modeling is a daunting task in recent days due to the highly volatile and non-linear nature of the modern financial environment. Existing studies on stock market prediction can be segmented into traditional statistical approaches, machine learning approaches, deep learning approaches, and more advanced attention-based deep learning approaches. Traditional methods are based on statistical techniques that model inherent temporal dependencies of time-series sequences, namely ARMA, and ARIMA [7,24]. These methods use the autoregressive terms to learn the temporal dependencies in financial time series and to mitigate the presence of random noise. However, these methods work based on the assumption that the data is stationary and linear, which is often not the case with stock data because of its high randomness and non-linearity. Traditional methods also need the conversion of non-stationary time-series sequences into stationary time series to effectively capture the pattern in stock time-series data. Therefore, owing to the uncertain, chaotic, and heavy-tailed nature of the modern stock market, these methods struggle to perform well in predicting stock prices.

ML techniques possess robust generalization power, which enables them to effectively capture temporal dependencies in unseen stock market data. Among these, Support Vector Machines (SVMs) have been widely leveraged in past studies for stock market prediction due to their inherent ability to capture non-linearity and their capability to handle high-dimensional data [25]. Due to their efficacy, these models are often hybridized with other machine learning approaches for the prediction of stock prices, offering more promising predictive accuracy. The author [26] proposed a cumulative model which includes least square Support Vector Machine and Autoregressive Moving Average techniques for stock market forecasting. The results of their study showed that the combined model outperformed the individual SVM model.

ANNs have attracted researchers and academicians for stock market prediction due to their capability to adapt to new data and to model intricate non-linear patterns in stock market data. The author [27] performed a systematic review on the efficacy of ML techniques for stock market forecasting. Their study concluded that the ANNs are utilized most for stock market forecasting. Researchers also utilized numerous ensemble models to develop a high-performance stock market prediction model. The author [28] evaluated the efficacy of different ensemble methods for stock market forecasting. Despite their success, there are various challenges associated with these models, such as the consideration of base classifiers and the concern of choosing the right concatenation methods among many, such as stacking, boosting, bagging, and blending. The study [29] presented a hybrid framework utilizing technical indicator-derived trading rules with a genetic algorithm to maximize the Sharpe and Sterling Ratio (GA-MSSR). This work demonstrates competitive results and broadens the scope of financial prediction tasks.

With the recent advancements in neural network technologies, deep learning models like LSTM and GRU, have become prominent in modeling different ML tasks, specifically their capability in modeling long-term complex dependencies in time-series sequences as highlighted in [17,30,31] presents a unique neural network, leveraging stock vectors and LSTM with an automatic encoder to predict stock prices. The study revealed that converting the time series into stock vectors creates more contextual information for the model and yields better results on the Shanghai A-Shares Index. The study [32] proposed a noise-resilient temporal modeling approach, Robust Dual Recurrent Neural Networks (RDRNNs), to extract patterns from volatile financial time series. Empirical evaluations on Chinese stock indices show that the proposed work outperforms standard RNN baselines. Various studies by [33,34,35,36,37], and others highlighted the efficacy of the LSTM model for long-range stock market time-series forecasting as compared to conventional methods. The success of the LSTM model relies on modeling intricate patterns and capturing non-linear temporal relationships in stock data, which conventional methods struggle to achieve. Due to the gating mechanism of the LSTM model, it learns both short-term and long-term patterns effectively, and therefore offers a robust and reliable predictive model in the modern financial environment. A recent comprehensive review presented in [38] evaluated a broad category of deep learning techniques, such as CNN, LSTM, DNN, RNN, reinforcement learning, Hybrid Attention Networks, Wavenet, and NLP-driven self-paced learning for stock and Forex prediction tasks.

Bidirectional variants such as BiLSTM and BiGRU further enhanced the modeling capabilities by processing input sequences bidirectionally, that is, modeling temporal relationships utilizing both past and future time steps. Hybrid models combining CNNs and LSTM-based layers have also been explored for capturing spatial and temporal features. The author [39] proposed a cumulative CNN–BiLSTM model for the prediction of the Shanghai Composite Index. The study at first leveraged CNN for feature extraction from the time-series sequence and then utilized the BiLSTM model for prediction. They concluded that the proposed model yields a higher R² as compared to seven other baseline models. The author [40] proposed a BiGRU model with hyperparameter optimization using the Whale algorithm for stock market forecasting. The study revealed that the proposed model greatly improved the prediction performance.

The introduction of attention mechanisms by [41], further revolutionized time-series forecasting. Attention mechanisms adjust the weights of the features dynamically based on their importance. Among many categories of attention, temporal attention has garnered significant interest among researchers for modeling time-series data. Temporal attention techniques allow the model to dynamically focus on the most critical time steps of the input sequence. Although attention techniques, started with their application in computer vision [42], were further investigated by researchers in various other domains, specifically in financial time-series forecasting [43,44]. This capability of focusing on the most critical time steps, which in turn boost the prediction accuracy, particularly in dynamic financial markets make it valuable.

RNNs with attention mechanisms were explored by researchers to boost the prediction power of the model owing to their ability to focus on the most influential time steps [45]. The author [46] implemented an attention-based LSTM model to predict the stock prices of the S&P 500 and DJAI stock indices. The results revealed that the attention-based LSTM model outperformed the other two baseline models, namely LSTM and GRU, in terms of Mean Absolute Error. The author [47] employed attention mechanisms in LSTM and GRU for the prediction of stock price trends. The experimental results show that the proposed hybrid outperformed the individual LSTM and GRU models owing to the introduction of the attention technique. The author [48] proposed an integration of the attention mechanism with a CNN-BiLSTM model to improve predictions due to its ability to dynamically weight important features. The findings of the study show that the proposed hybrid achieves the highest accuracy compared to LSTM and CNN-LSTM across eight international stock indices. The author [49] incorporated the Bahdanau Attention mechanism into the GRU model to accurately forecast short-term load. The experimental findings revealed that the cumulative model enhanced accuracy compared to several baseline models, as Bahdanau Attention enables alignment between input time series and targets by dynamically weighting important features. The author [50] proposed an attention-based deep BiLSTM model optimized with the COOT Birds algorithm for modeling stock market time series. The study revealed that the proposed optimized model yielded better results in terms of several regression error metrics on the NIFTY 50 stock index.

Transformer, initially designed for natural language tasks, has been applied in stock market prediction due to its core component—the multi-head attention mechanism—which enables it to process the time-series sequence parallely and capture the complex dynamics of the stock market. The author [51] implemented the transformer model for prediction of several stock indices. Their findings revealed that the transformer model outperformed several classical deep learning models in all indices. Additionally, the transformer models are combined with baseline models like LSTM, GRU, BiLSTM, and BiGRU to further enhance the predictions and stability of the model. The author [52] proposed a combination of BiLSTM, to process the sequence in both directions, and a transformer model, to capture long-term dependency in time-series sequences. They have tested the performance of the proposed model on five stock indices, and the empirical result analysis shows that the proposed hybrid significantly outperformed other existing methods in several regression error metrics. The study [53] utilized market sentiments along with numerical time series as input to the proposed FinALBERT language model for the prediction of stock movements. Experimental evaluations on Stocktwits data across 25 major companies, including FAANG stocks, show a substantial improvement in the predictive power of the model.

The continued evolution and success of various attention mechanisms for building a more sophisticated forecasting model make it a cornerstone architecture for the financial time-series forecasting domain. However, many existing works incorporate only a single type of attention mechanism, limiting the model’s flexibility in capturing various types of temporal patterns. Moreover, the application of Bahdanau Attention in stock market forecasting has not yet been explored in the financial forecasting domain. Furthermore, prior works on attention-based stock forecasting tasks were designed on single-scale temporal modeling that is either to capture local dependencies or global dependencies. Models focused on capturing local dependencies are good for short-term volatility but struggle to capture far historical data. In a similar sense, models based on capturing global dependencies struggle to capture sharp local moves. Thus, these frameworks fail to capture multi-scale patterns inherent in the financial time series. Another important limitation in earlier studies is model interpretability. Even though earlier studies utilized attention-based models, the interpretability of the models is still limited to Attention Weight Visualization that partially explains the models prediction. They seldom used richer XAI approaches like SHAP and integrated gradients to relate the financial events or features in the financial time series. Thus, earlier studies still treated the models as partial black boxes.

Motivated by these unresolved issues, the proposed BiG-FA architecture integrates three complementary attention mechanisms—Sparse Attention for short-term movements, Global Attention for long-range dependencies, and Bahdanau Attention for fine-grained temporal alignment. This proposed fusion directly addresses the limitation of prior works, which struggle to capture the dual temporal scale, i.e., to model both short and long temporal scales simultaneously. Furthermore, we used multiple explainability tools, including Attention Weight Visualization, SHAP, and Integrated Gradients, to explain the proposed model decisions—an aspect often overlooked in earlier stock forecasting studies.

3. Proposed Model Description

The proposed model, named as BiGRU + Fused Attention (BiG-FA), integrates the temporal learning strengths of Bidirectional Gated Recurrent Units (BiGRUs) with a fusion of three different attention mechanisms: Sparse Attention, Global Attention, and Bahdanau Attention. This section provides details of the model architecture that anticipates the closing price of stock indices and the mathematical formulation of each attention approach. The block diagram of the proposed Fused Attention Model is illustrated in Figure 1. The algorithm for the proposed BiG-FA model for stock closing price forecasting is illustrated in Algorithm 1.

Algorithm 1 BiG-FA Model for Stock Closing Price Forecasting

Input:
Univariate time series $X = [x_{1}, x_{2}, \dots, x_{T}]$
Sequence length L
Forecast horizon: 1-step ahead
Output: Predicted value ${\hat{y}}_{t + 1}$

1:: Normalize $(X)$ using MinMaxScaler.
2:: Create sequences $X_{i} = [x_{i}, x_{i + 1}, \dots, x_{i + L - 1}]$ , target $y_{i} = x_{i + L}$ .
3:: Input Shape: $(L, 1)$
4:: for epoch = 1 to 50 do
5:: $H = [h_{1}, h_{2}, \dots, h_{L}] \leftarrow B i d i r e c t i o n a l G R U (X)$ , $h_{t} \in R^{2 d}$
6:: Compute Sparse Attention Scores: $e_{t} = W_{s}^{T} h_{t}$
7:: Compute Mask scores not in top-k:

${\tilde{e}}_{t} = \{\begin{matrix} e_{t} & if t \in Top ‐ k (e) \\ - \infty & otherwise \end{matrix}$
8:: Compute Sparse Attention Weights: $α_{t}^{(s)} = \frac{exp ({\tilde{e}}_{t})}{\sum_{j \in Top ‐ k} exp ({\tilde{e}}_{j})}$
9:: Compute Sparse Context Vector: $c_{sparse} = \sum_{t = 1}^{L} α_{t}^{(s)} h_{t}$
10:: Compute Global Attention Scores: $e_{t} = W_{g}^{T} h_{t}$
11:: Compute Global Attention Weights: $α_{t}^{(g)} = \frac{exp (e_{t})}{\sum_{j = 1}^{L} exp (e_{j})}$
12:: Compute Global Context Vector: $c_{global} = \sum_{t = 1}^{L} α_{t}^{(g)} h_{t}$
13:: Compute Average Query for Bahdanau: $q = \frac{1}{L} \sum_{t = 1}^{L} h_{t}$
14:: Compute Alignment Score: $s_{t} = tanh (W_{q} q + W_{h} h_{t} + b)$
15:: Compute Bahdanau Attention Weights: $α_{t}^{(b)} = \frac{exp (v^{T} s_{t})}{\sum_{j = 1}^{L} exp (v^{T} s_{j})}$
16:: Compute Bahdanau Context Vector: $c_{bahdanau} = \sum_{t = 1}^{L} α_{t}^{(b)} h_{t}$
17:: Compute Concatenate Context Vectors: $c_{fused} = [c_{sparse} ∥ c_{global} ∥ c_{bahdanau}]$
18:: Pass through Dense layer with ReLU and Dropout
19:: Final Prediction: ${\hat{y}}_{t + 1} = Dense (c_{fused})$
20:: Optimize using MSE loss and Adam optimizer
21:: end for

3.1. Input Representation

The input univariate time series is denoted as

X = [x_{1}, x_{2}, \dots, x_{T}], x_{t} \in R

The training sequences is constructed using a sliding window of fixed length L. Each sequence–target pair is given by

X_{i} = [x_{i}, x_{i + 1}, \dots, x_{i + L - 1}] \in R^{L}, y_{i} = x_{i + L}

The time series is normalized using MinMaxScaler to the range [0, 1].

In this study, we employ closing prices as input representation. However, the proposed model can be extended to incorporate microstructure features such as volume, turnover, and bid–ask as well as macroeconomic variables like gross domestic product, inflation, interest rates, purchasing manager index, exchange rate, and money supply. Microstructure feature integration can be achieved by utilizing a cross-feature attention module which models interactions between price and non-price features effectively. Integration of macroeconomic variables requires an effective alignment mechanism to solve the mixed-frequency challenges because these indicators are updated weekly or monthly. A few techniques such as forward-fill interpolation, upsampling, and Mixed Data Sampling (MIDAS) networks can be utilized to address this challenge. However, we avoid incorporating these features in the present study in order to maintain a pure time-series forecasting framework.

3.2. BiGRU Encoder

The sequence of input observations

X_{i}

is propagated through Bidirectional GRU layer to produce a set of sequential hidden states:

H = [h_{1}, h_{2}, \dots, h_{L}], h_{t} \in R^{2 d}

where d is the dimension of the GRU hidden state. The output

H \in R^{L \times 2 d}

captures temporal dependencies in both previous and upcoming time steps.

3.3. Sparse Attention

Sparse Attention focuses on key time steps, ignoring the rest. Unnormalized attention scores are computed as

e_{t}^{(s)} = W_{s}^{⊤} h_{t}

Only the top-k highest scores are retained:

{\tilde{e}}_{t} = \{\begin{matrix} e_{t}^{(s)} & if t \in Top ‐ k (e) \\ - \infty & otherwise \end{matrix}

Attention Weights are computed via softmax function:

α_{t}^{(s)} = \frac{exp ({\tilde{e}}_{t})}{\sum_{j \in Top ‐ k} exp ({\tilde{e}}_{j})}

The Sparse Attention context vector is

c_{sparse} = \sum_{t = 1}^{L} α_{t}^{(s)} h_{t}

3.4. Global Attention

Global Attention considers all time steps. Attention scores and weights are calculated as

e_{t}^{(g)} = W_{g}^{⊤} h_{t}, α_{t}^{(g)} = \frac{exp (e_{t}^{(g)})}{\sum_{j = 1}^{L} exp (e_{j}^{(g)})}

The resulting global context vector is

c_{global} = \sum_{t = 1}^{L} α_{t}^{(g)} h_{t}

3.5. Bahdanau Attention

A shared query vector q is formed by averaging the hidden states:

q = \frac{1}{L} \sum_{t = 1}^{L} h_{t}

The alignment score for each time step is computed as

s_{t} = tanh (W_{q} q + W_{h} h_{t} + b)

The Bahdanau Attention Weights are as

α_{t}^{(b)} = \frac{exp (v^{⊤} s_{t})}{\sum_{j = 1}^{L} exp (v^{⊤} s_{j})}

The Bahdanau context vector is computed as

c_{bahdanau} = \sum_{t = 1}^{L} α_{t}^{(b)} h_{t}

3.6. Fusion and Output Prediction

The outputs of all three attention techniques are concatenated to form a fused context vector:

c_{fused} = [c_{sparse} ∥ c_{global} ∥ c_{bahdanau}] \in R^{6 d}

This fused vector is passed through a dense layer with ReLU activation and dropout:

z = ReLU (W_{z} c_{fused} + b_{z})

The final prediction is computed as

{\hat{y}}_{t + 1} = W_{o} z + b_{o}

3.7. Optimization

To train the proposed model, the Mean Squared Error (MSE) is adopted as the loss function, which is calculated as average squared difference between the actual values and the predicted outputs:

L_{MSE} = \frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2}

The parameters of the proposed model are optimized using the Adam optimizer. Adam optimizer has the added advantages of efficient and stable convergence during training.

3.8. Rationale Behind the Proposed Architecture

Prior works on financial time-series forecasting utilizing single-attention or dual-attention were not found to be effective, as financial time series contain both short-term volatility and long-term cyclical behavior. The limitation of single-attention models is that they do not consider both temporal scales by over-emphasizing either local patterns or global context. On the other hand, many studies utilizing dual-attention mechanisms improved the coverage but still fail to balance noise-driven short-term movements and structural long-term dependencies.

To address these limitations, BiG-FA integrates three complementary attention modules on top of a BiGRU backbone:

Top-k Sparse Attention is utilized to focus on only the most impactful time steps, effectively capturing short-term and high-impact fluctuations while avoiding noise in the time series.
Global Attention is utilized to learn long-term trends, cycles, and slow-moving dependencies.
Bahdanau Attention is utilized to provide context-specific weighting on time steps, offering fine-grained alignment between hidden states and the prediction target, thereby improving interpretability.

BiGRU further improves temporal modeling by capturing bidirectional dependencies. Overall, this triple-attention fusion on top of BiGRU enables the model to handle multi-scale temporal dynamics, avoid over-attention to noise or irrelevant time steps, preserve important long-range financial structure, and generate interpretable Attention Weights useful for XAI and financial reasoning.

4. Dataset Description and Preprocessing

This study uses daily closing price data from two well-known stock market indices that are highly uncertain, volatile, and non-linear in nature: the S&P 500 (representing the U.S. stock market) and the NIFTY 50 (representing the Indian stock market). The data was extracted using the yfinance Python 3.12.12 library and the Yahoo Finance API for the period from 1 January 2015 to 31 December 2024.

The close price, which is the last traded price on a given trading day, was used as the sole feature for model training and forecasting. This price is commonly used in financial analysis and technical forecasting due to its stability and wide acceptance.

Data Preparation

To ensure efficient training and faster convergence of deep learning models, the raw data was preprocessed through the following steps:

Handling Missing Values: The datasets were carefully examined for null values. Any dates associated with missing closing prices were removed from the dataset to ensure data integrity.
Normalization: The Min-Max Normalization technique was employed to scale the closing prices to a range between 0 and 1. This transformation assists neural networks in learning efficiently by reducing the impact of differing price magnitudes across datasets.
Sequence Formation: A sliding window approach was used to transform the time series data into a supervised learning format. A sequence length of 30 days was chosen, meaning each input sample consists of the previous 30 days’ closing prices, and the corresponding target is the closing price of the next day.
Train–Test Split: The dataset was split into two parts: 80% for training and 20% for testing. The model was trained on the training set and evaluated on the unseen test set to assess its predictive performance.

5. Result Analysis

The results are captured and represented for both NIFTY 50 index and S&P 500 stock index data.

5.1. Performance on NIFTY 50 Index

The results of the proposed model on the NIFTY 50 stock index, along with comparisons against various attention-based deep learning models, are presented in Table 1 and illustrated in Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, Figure 7, Figure 8, Figure 9, Figure 10, Figure 11, Figure 12, Figure 13, Figure 14, Figure 15, Figure 16, Figure 17 and Figure 18. We employed four commonly used regression metrics for evaluation: Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Mean Absolute Percentage Error (MAPE), and the Coefficient of Determination (R² Score).

Experimental results of the proposed BiG-FA model across all evaluation metric, as shown in the Table 1, demonstrate that the model outperforms all other comparative models on this index. Specifically, it attained the highest R² score of 0.9955 and the lowest error values in terms of MAE with a value of 124.70, RMSE with a value of 173.34, and MAPE with a value of 0.58%. These results, as shown in Figure 2, affirm that Fused Attention Models are effective in modeling relationships among time steps in the entire sequence and extracting context-aware information embedded in financial time series.

The BiGRU + Bahdanau Attention and BiLSTM + Bahdanau Attention models as shown in Figure 3 and Figure 4 also demonstrated strong performance, with results close to those of the Fused Attention Model. This suggests that bidirectional recurrent networks, when combined with additive attention mechanisms, are well-suited for modeling sequential financial data.

Across the NIFTY 50 index, models based on attention-enhanced recurrent architectures delivered the strongest predictive performance. GRU + Luong Attention emerged as the best-performing model, achieving an MAE of 209.74, an RMSE of 271.52, and an R² score of 0.9890. N-BEATS followed closely with an MAE of 270.86, an RMSE of 341.48, and an R² score of 0.9826. Other GRU-based variants such as GRU + Bahdanau Attention and the standalone GRU model also showed high predictive strength, with R² scores above 0.974 and error values remaining relatively low.

Attention-augmented LSTM models have also achieved competitive accuracy. LSTM + Bahdanau Attention produced an MAE of 405.53, an RMSE of 493.85, and an R² score of 0.9612. On the other hand, GRU + Vanilla Attention and DeepAR achieved R² values ranging between 0.90 and 0.94, with MAE values of 631.98 and 662.39, respectively. Also, we observed that the performance of models such as LSTM + Luong Attention and LSTM + Vanilla Attention downgraded further. These models have recorded lower R² values of 0.8732 and 0.8518, respectively.

Models lacking strong attention mechanisms or hierarchical temporal modeling showed more limited predictive capability. The TCN model achieved an R² score of 0.6728 with an MAE of 1198.30, while the standard LSTM model reached an R² score of 0.6518 with an MAE of 1472.77. The CNN + LSTM + Attention architecture performed even worse, obtaining an R² score of 0.4692 and an MAE of 1782.56, indicating that convolutional feature extraction did not meaningfully enhance forecasting accuracy for this index.

The weakest outcomes were observed in non-sequential machine learning models. Random Forest produced an MAE of 2855.10 and an RMSE of 3709.02, along with a negative R² score of −1.0560. XGBoost performed similarly poorly with an MAE of 3010.03, an RMSE of 3861.51, and a negative R² score of −1.2285. These results clearly demonstrate that tree-based ensemble methods are unable to model the sequential and long-range temporal dependencies present in financial time series.

In conclusion, the results on the NIFTY 50 index strongly support the effectiveness of the proposed Fused Attention approach. By leveraging the complementary strengths of multiple attention mechanisms, the model demonstrates enhanced learning and generalization capabilities, ultimately leading to more accurate and robust stock price predictions.

5.2. Performance Comparison on S&P 500 Stock Index

A thorough performance comparison of the proposed Fused Attention Model against a number of baseline and hybrid deep learning models was carried out using four common evaluation regression metrics: MAE, RMSE, MAPE, and R² Score. The results of the BiG-FA model, along with comparisons against other models used in this study, are presented in Table 2 and illustrated in Figure 19, Figure 20, Figure 21, Figure 22, Figure 23, Figure 24, Figure 25, Figure 26, Figure 27, Figure 28, Figure 29, Figure 30, Figure 31, Figure 32, Figure 33, Figure 34 and Figure 35.

As shown in the result, the BiGRU + Fused Attention model achieved the highest R² of 0.9961, an MAE of 30.13, an RMSE of 40.29, and a MAPE of 0.63. This indicates its ability to model the S&P 500’s underlying dynamics as compared to other baselines. We also observe that bidirectional architectures with attention achieved the most competitive R² values, with both BiGRU + Bahdanau and BiLSTM + Bahdanau Attention models reaching an R² of 0.9892.

Furthermore, we noticed that attention-enhanced recurrent models demonstrated superior predictive capability. For instance, GRU + Luong Attention had the best results, with an MAE of 77.86, an RMSE of 111.72, and an R² score of 0.9699. DeepAR gained a better R² value of 0.9631, showing its predictive power in stock indices. The GRU model, without any attention mechanism, achieving an R² of 0.9597 also shows its competitiveness. These results reaffirm the capability of gated recurrent structures for modeling the S&P 500 stock index.

Attention-augmented LSTM models showed a declining trend in all performance metrics, though with a better R² value. For example, LSTM + Bahdanau Attention produced an R² value of 0.9533, while LSTM + Luong Attention, GRU + Vanilla Attention and LSTM + Vanilla Attention exhibited a further decline in R² scores. This declining performance indicates that the LSTM models augmented with attention mechanisms were less effective in capturing market variability as compared to their counterparts.

Interestingly, there is a further increase in error rate in hierarchical temporal models or models without having any robust attention mechanisms. We can observe from the results that the CNN + LSTM + Attention model just achieved an R² score of 0.7651 and a low MAE. Similarly, hierarchical models like TCN, standard LSTM and N-BEATS performed similarly, with an R² score below 0.75. These results highlight the limitations of these architectures in predicting the future price of the S&P 500 stock index.

We also observed that the lowest performance was recorded by non-sequential machine learning models. The R² value of Random Forest and XGBoost even went below 0.30. These results clearly show that tree-based ensemble methods are ineffective in modeling sequential data. Overall, the results on the S&P 500 index further highlight the robustness and adaptability of our Fused Attention framework for modeling complex financial time series.

In conclusion, the proposed BiGRU + Fused Attention model effectively integrates the strengths of Bahdanau, Global, and Sparse Attention mechanisms, while benefiting from the bidirectional capabilities of GRUs. This architectural fusion provides superior generalization ability, enabling the model to reliably capture the underlying dynamics in the S&P 500 stock index.

5.3. Regime-Based Stress Testing and Crisis Generalization

In this study, we have performed an empirical evaluation of models in different market stresses. Two separate regime-based stress testings, such as performance of the model during the COVID-19 period and the Global Financial Crisis, are evaluated to assess its generalization. For both the S&P 500 and NIFTY 50 indices, we have considered the data spanning from 2016 to 2019 for training of the proposed BiG-FA model, as it reflects the data stability period, and used the COVID-19 crash window data, i.e., February–July 2020, to evaluate the generalization of the model. This helps confirm the capability of the proposed model to showcase its performance during a sudden crisis in the financial market. Similarly, to perform the experiments on the Global Financial Crisis period, the data from 2003 to 2009 for both the S&P 500 and NIFTY 50 indices is considered. This data is further divided into the pre-crisis period, which is 2003–2006, and the turbulence period, which is from 2007 to 2009. The earlier pre-crisis period data is used for training, and the later turbulence period data is used for testing the model, which allows us to analyze its performance during structural shifts.

The performance of the proposed model during the crisis period is shown in Table 3. The results showcase that the proposed BiGRU–Fused Attention model is able to generalize well across all the indices. We observed that, despite the S&P 500 index experiencing rapid daily swings and high volatility during the COVID-19 crash in several sessions, the proposed model achieved an R² value of 0.8167. This data shows the ability of the model to explain over 81% of the variance even under turbulence. Furthermore, the model achieved a higher R² of 0.8711 on the NIFTY 50 stock index, though the index sharply oscillates after the COVID-19 crash. Overall, the proposed model demonstrates stability across indices during the COVID-19 crash period.

Through empirical results on Global Financial Crisis data, we observed that the model demonstrated better results, though this data has prolonged downward trends. The proposed model achieved an exceptional R² value of 0.9734 on the S&P 500 stock index, which indicates its ability to capture more than 97% of variance. A similar pattern is observed for the NIFTY 50, with an excellent R² of 0.9731. Though the data exhibit smoother downward trends, the performance of our proposed model maintained low percentage errors in all metrics. In conclusion, the proposed Fused Attention Model explains near-perfect variance and retains predictive stability during highly chaotic periods of financial markets.

6. Performance Evaluation on U.S. Stocks

To ensure the evaluations generalizability of the proposed forecasting framework, we evaluate all seventeen models individually across five U.S. equities. We have chosen five U.S. equities—Apple (AAPL), Microsoft (MSFT), Amazon (AMZN), Alphabet (GOOGL), and Meta (META)—for comparative analysis of models. These equities were chosen due to their strong representation in major U.S. indices (Nasdaq-100, S&P 500) and their distinct volatility and price behavior patterns. Furthermore, these equities are frequently studied in the financial forecasting literature. We have analyzed the performance of the models based on four different metrics, such as the MAE for finding the absolute error, the RMSE for penalizing large deviations, the MAPE for finding percentage-based error, and R² for variance explained measure.

6.1. Performance Comparison of Models on AAPL Stock Close Price Prediction

The BiGRU + Fused Attention model, as shown in Table 4, had the best overall performance for AAPL, with the highest R² value of 0.9807. This shows its superior ability to capture short- and long-term patterns in AAPL stock. The BiGRU + Bahdanau Attention model came in second with an R² of 0.9660, and the N-BEATS architecture also did well with an R² of 0.9651. We further noticed that GRU + Bahdanau Attention demonstrated moderate forecasting accuracy. LSTM-based models, CNN–LSTM hybrids, Random Forest, and XGBoost fails to improve the performance in AAPL stock price prediction.

6.2. Performance Comparison of Models on MSFT Stock Close Price Prediction

As per the result shown in Table 5, the BiGRU + Fused Attention model showed the best performance among all the models in MSFT equity price forecasting, with an R² value of 0.9612. The BiGRU + Bahdanau Attention model was the next best, recording an R² of 0.9589. Another finding is that hybrid GRU- and LSTM-based attention models demonstrated moderate results, with RMSE values between 18 and 32, but they still lagged behind the BiGRU-based models. Models such as DeepAR, TCN, and the Vanilla LSTM showed more noticeable performance drops. Among all the models, XGBoost performed poorly with negative R² values. This result shows that the ensemble tree model cannot capture the temporal dependencies and is unable to model the dynamic structure of MSFT equity effectively due to the assumption of treating all the time steps as independent.

6.3. Performance Comparison of Models on AMZN Stock Close Price Prediction

As reported in Table 6, the BiGRU + Fused Attention model produced the strongest forecasting performance for AMZN, achieving an R² value of 0.9894. The DeepAR model ranked second, yielding the next best MAE of 4.30 and an RMSE of 5.24. Another observation is that N-BEATS model performed well on AMZN equity with an R² score of 0.9669. In contrast, BiGRU + Bahdanau Attention and GRU + Luong Attention performed poorly as compared to the top model, even though they had a good performance on MSFT equity, maintaining RMSE values in the range of 7.23 to 10.48. Furthermore, models such as Vanilla LSTM, TCN, and GRU + Vanilla Attention incurred higher error rates. The results of traditional machine learning models such as Random Forest and XGBoost indicate limited predictive capability for AMZN’s highly volatile price dynamics.

6.4. Performance Comparison of Models on GOOGL Stock Close Price Prediction

As presented in Table 7, the BiGRU + Fused Attention model achieved the best performance for GOOGL, with an R² value of 0.9885. We observe that bidirectional variants with Bahdanau Attention models exhibited strong forecasting capability. We further notice that GRU-based hybrid models performed reasonably, though they had weaker predictive accuracy compared to the top-performing models. Substantial reductions in performance for DeepAR, N-BEATS, and Vanilla LSTM models show that the models struggle to capture the underlying structure in GOOGL’s price forecasting. Both tree-based models have R² barely above 0.62, confirming that they fail to capture temporal dependencies in GOOGL price data.

6.5. Performance Comparison of Models on META Stock Close Price Prediction

As shown in Table 8, the BiGRU + Bahdanau Attention model delivered the strongest forecasting performance for the META stock price prediction with an R² value of 0.9891. The BiGRU + Fused Attention model followed closely, with an R² of 0.9875. However, the performance declined for LSTM- and GRU-based variants as well as the TCN model, with RMSE values exceeding 39. This confirms that Bidirectional GRU combined with attention captures temporal dependencies in META prices much better than unidirectional GRU and LSTM. XGBoost and Random Forest have the highest MAE and lowest R², showing poor fit compared with recurrent and attention-based architectures.

6.6. Comparative Discussion Across All Stocks

As per the results analysis on all the U.S. stock equities, we observed a consistent pattern that the bidirectional recurrent model along with attention augmentation outperformed all other tested models. Specifically, our proposed model, which includes a BiGRU and three different attention mechanisms, achieved a lower error rate and the highest R² value than all other counterparts. Thus, we can conclude that the attention fusion clearly showcased its effectiveness as compared to other deep learning models utilizing single attention, standard recurrent models, and tree-based models, due to its ability to learn across different temporal scales.

Furthermore, modern deep-learning models such as N-BEATS and DeepAR demonstrated competitive performance on certain stocks. However, the performance of these models is inconsistent, and they struggle to generalize well for certain stocks. Another finding is that traditional machine learning models have showcased a higher error rate, revealing their limitations in modeling non-linear, highly dynamic financial time-series data. Overall, the benefits of augmenting three different attention mechanisms on top of the BiGRU deep model underscore the novel fused model’s capability to model diverse market conditions and stock characteristics. Thus, the model’s consistent performance highlights its efficacy for practical deployment in short-term stock price forecasting.

7. Explainability of the Proposed BiG-FA Model

In high-stakes domains such as financial forecasting, it is crucial not only produce accurate predictions but also to offer interpretability for informed decision-making. To enhance the transparency of our proposed Fused Attention Model, we employed three prominent explainability techniques: Integrated Gradients, Attention Weight Analysis, and SHAP. Figure 36, Figure 37, Figure 38, Figure 39, Figure 40 and Figure 41 illustrate how each technique provides a unique perspective on feature importance and temporal sensitivity.

7.1. Attention-Based Interpretability of Price Sequences

Forecast using the proposed BiG-FA model effectively captured both short- and long-term temporal patterns. The proposed model outperformed all other baseline and attention-based models, achieving MAE = 124.70, RMSE = 173.34, MAPE = 0.58%, and R² Score = 0.9955.

We extracted attention weights from the Top-k Sparse, Global, and Bahdanau Attention mechanisms applied to a representative 30-day input sequence from both the NIFTY 50 and S&P 500 indices. These visualizations help explain how the Fused Attention mechanisms contribute to the model’s decision-making process.

7.1.1. Top-k Sparse Attention

The Top-k Sparse Attention mechanism selectively focuses on recent time steps, assigning high weights only to a few recent days. From the attention plot, we observe that it activates primarily during the last five time steps, ignoring older prices. This sharp focus mirrors trading heuristics that suggest recent price movements contain more relevant predictive signals, capturing strong short-term temporal dependencies.

7.1.2. Global Attention

Global Attention applies a softmax distribution over all time steps, allowing the model to attend to the entire sequence while prioritizing relevant portions. The plotted attention weights gradually increase toward the more recent days, indicating a preference for medium-term dependencies. This suggests that although historical data is not entirely disregarded, recent data points dominate the forecasting process.

7.1.3. Bahdanau Attention

Bahdanau Attention, an additive attention mechanism using learned queries and keys, displays highly selective and non-linear weighting. Its attention distribution strongly peaks toward specific recent time steps. This adaptive focusing bridges the gap between the rigid sparsity of Top-k Attention and the smoothness of Global Attention, resulting in a more flexible and context-aware temporal understanding.

7.2. Integrated Gradients Attribution

Figure 38 and Figure 39 present the attribution scores computed using the Integrated Gradients method. The results show a concentration of influence in the last three to five time steps, particularly timestep

t_{29}

. Earlier time steps have near-zero contributions, reinforcing that the model predominantly relies on short-term dynamics in the financial time series, such as the NIFTY 50 and S&P 500 indices.

7.3. SHAP Value Interpretation

To further evaluate feature contributions, we applied SHAP (SHapley Additive exPlanations) across five test samples. The SHAP summary plot in Figure 40 and Figure 41 highlights the most impactful features as the most recent five timesteps:

t_{29}

to

t_{25}

. The color intensity further shows that higher values at these time steps positively influence predictions. Early time steps (

t_{0}

to

t_{19}

) contribute negligibly, confirming their low predictive relevance.

7.4. Interpretability Alignment and Insights

We compared gradient-based attribution (Integrated Gradients), model-agnostic SHAP values, and internal attention weights to assess the model’s transparency and consistency. All three methods consistently emphasize the final five to six time steps in the 30-day input sequence, demonstrating strong temporal alignment. This agreement across diverse interpretability techniques validates the robustness of the model’s learned temporal patterns.

The convergence of these explainability strategies—especially in complex financial datasets like the NIFTY 50 and S&P 500 indices—indicates that the proposed model effectively captures recent market behavior. This strengthens confidence in the model’s predictions and supports its deployment in real-world financial decision-making scenarios where both accuracy and interpretability are essential.

7.5. Coherence Between Performance and Attention Mechanisms

The high R² values for the NIFTY 50 (0.9955) and S&P 500 (0.9961) indicate that the model is able to predict the close price with extremely small deviation, closely aligning with the actual market movements at each time step. These strong predictive correlations are a direct effect of utilizing the triple-attention fusion mechanism. The proposed triple-attention architecture, consisting of Top-k Sparse Attention, Global Attention, and Bahdanau Attention, is designed to help the model learn both short-term changes and long-term structures effectively. Top-k Sparse Attention is utilized to give higher importance to short-term market volatility and sudden movements that have a strong impact on next-day stock behavior. Global Attention complements this by preventing the influence of short-term noise and capturing long-term temporal patterns such as slow-moving upward or downward trends and periodic cycles. Moreover, Bahdanau Attention refines the weighting of temporal information based on the current forecasting requirement, providing a context-sensitive alignment between past hidden states and the target output.

The results of interpretability evaluation techniques, such as Integrated Gradients, SHAP values, and internal attention weights, show that the model focuses on the most recent and financially relevant time steps. The strong alignment between prediction performance and interpretability increases trust in the model. This alignment is not by chance or due to overfitting, but rather due to the model focusing on time steps that are influential and economically justified. The alignment of results between the triple-attention mechanism and the XAI findings also enhances the credibility of the proposed model. This alignment not only makes it suitable for deployment in real-world environments but also helps financial practitioners who require both accuracy and transparency.

7.6. Limitations and Future Work

Though our proposed BiG-FA model has shown consistent and robust performance across indices and in individual U.S. stocks, it is dependent on a single close price feature for learning. Another limitation is the omission of practical trading simulations. So, future work may explore incorporating exogenous variables such as macroeconomic indicators and market microstructure features to capture broader economic context. But these integrations require two different approaches to make it fit for stock price forecasting. First, the input data format must be redesigned, that is, separate encoders for price and macro data to avoid misalignment. Second, it requires modification in architecture, such as multi-channel temporal encoders or feature-wise attention modules, to weight the feature influence. Furthermore, future models could use volume as an additional channel to enhance the performance. However, it requires careful handling to deal with the problem of temporal misalignment that arises from different granularities levels of price and volume.

Another promising direction for future work is the integration of multimodal information. This includes embedding financial news, macroeconomic announcements from industries, and the sentiments of different stakeholders. Though it enriches the feature space for the model, combining these with numeric features requires a special cross-modal attention to understand the complex financial market behavior.

Finally, rigorous back-testing, transaction-cost modeling, and utilizing risk-adjusted performance metrics (e.g., Sharpe ratio, Sortino ratio, maximum drawdown) may be explored to evaluate the usefulness of the model’s predictions in real trading, not just how accurately they perform in terms of different statistical performance metrics. Addressing these methodological and practical considerations will be essential for the proposed model to be used in real-world applications and to transform it into a comprehensive, multimodal, and operational financial forecasting system.

8. Conclusions

In this study, we proposed a novel BiGRU + Fused Attention model, BiG-FA, that utilizes three complementary attention mechanisms—Top-k Sparse, Global, and Bahdanau Attention—to improve the predictive performance and interpretability of deep learning models in stock market forecasting. The proposed model was rigorously evaluated on two highly volatile and uncertain stock indices, NIFTY 50 and S&P 500, using multiple evaluation metrics such as the MAE, RMSE, MAPE, and R² score.

The experimental results clearly show that the proposed Fused Attention Model significantly outperforms several baseline and hybrid deep learning models across all evaluation metrics. On the NIFTY 50 index, our proposed model yielded the highest R² score of 0.9955 and the lowest error values, showcasing its robustness and ability to capture complex non-linear temporal dependencies. A similar pattern of superior performance was observed on the S&P 500 index, reinforcing the model’s generalizability across different financial markets.

In addition to prediction accuracy, The focus was given strongly on the interpretability of the proposed BiG-FA model, which is a critical factor in financial applications and decision-making. By utilizing attention weight visualization, Integrated Gradients, and SHAP values, we provided deep insights into the black box behavior of the model and the temporal contribution of each input time step in the sequence. All three interpretability methods consistently showed that the model focuses on the most recent time steps, aligning well with financial intuition and reinforcing the credibility of the predictions.

It is concluded that the convergence of the proposed attention-based model with gradient-based attribution and SHAP techniques in identifying feature importance across both stock indices affirms the model’s ability to provide transparent justifications and enhance interpretability. This level of explainability is essential for building trust among various stakeholders and facilitating informed decision-making in financial domains.

In conclusion, the proposed Big-FA model offers a robust, scalable, and interpretable deep learning model for time-series forecasting in the financial domain. It defines a strong foundation for future research on combining multiple attention strategies and highlights the importance of XAI in financial DL based applications. As a part of future work, we aim to extend the proposed model to explore its applicability to other domains involving complex temporal relationships and develop real-time adaptive trading systems.

Author Contributions

Conceptualization, R.R.K., R.P., S.K.N. and R.K.B.; methodology, R.R.K. and R.P.; software, R.R.K., R.P., S.K.N. and R.K.B.; validation, R.P.; formal analysis, R.R.K., S.K.N. and R.K.B.; investigation, R.R.K.; resources, R.P.; data curation, R.R.K., R.P., S.K.N. and R.K.B.; writing—original draft preparation, R.R.K., R.P., S.K.N. and R.K.B.; writing—review and editing, R.R.K.; visualization, R.R.K.; supervision, R.P. and S.K.N.; project administration, R.P. and S.K.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets presented in this study are included in the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Gandhmal, D.P.; Kumar, K. Systematic analysis and review of stock market prediction techniques. Comput. Sci. Rev. 2019, 34, 100190. [Google Scholar] [CrossRef]
Anagnostidis, P.; Varsakelis, C.; Emmanouilides, C.J. Has the 2008 financial crisis affected stock market efficiency? The case of Eurozone. Phys. A Stat. Mech. Its Appl. 2016, 447, 116–128. [Google Scholar] [CrossRef]
Jiang, W. Applications of deep learning in stock market prediction: Recent progress. Expert Syst. Appl. 2021, 184, 115537. [Google Scholar] [CrossRef]
Liu, J.; Li, H.; Hai, M.; Zhang, Y. A study of factors influencing financial stock prices based on causal inference. Procedia Comput. Sci. 2023, 221, 861–869. [Google Scholar] [CrossRef]
Bollen, J.; Mao, H.; Zeng, X. Twitter mood predicts the stock market. J. Comput. Sci. 2011, 2, 1–8. [Google Scholar] [CrossRef]
Korkusuz, B. Beyond the S&P 500: Examining the role of external volatilities in market forecasting. Rev. Econ. Des. 2024, 29, 767–794. [Google Scholar] [CrossRef]
Salisu, A.A.; Gupta, R.; Ogbonna, A.E. A moving average heterogeneous autoregressive model for forecasting the realized volatility of the US stock market: Evidence from over a century of data. Int. J. Financ. Econ. 2022, 27, 384–400. [Google Scholar] [CrossRef]
Bildirici, M.; Ersin, Ö.Ö. Improving forecasts of GARCH family models with the artificial neural networks: An application to the daily returns in Istanbul Stock Exchange. Expert Syst. Appl. 2009, 36, 7355–7362. [Google Scholar] [CrossRef]
Bollerslev, T.; Mikkelsen, H.O. Modeling and pricing long memory in stock market volatility. J. Econom. 1996, 73, 151–184. [Google Scholar] [CrossRef]
Qian, X.Y. Financial Series Prediction: Comparison Between Precision of Time Series Models and Machine Learning Methods. arXiv 2017, arXiv:1706.00948. [Google Scholar]
Htun, H.H.; Biehl, M.; Petkov, N. Survey of feature selection and extraction techniques for stock market prediction. Financ. Innov. 2023, 9, 26. [Google Scholar] [CrossRef]
Liu, J.N.; Hu, Y. Application of feature-weighted Support Vector regression using grey correlation degree to stock price forecasting. Neural Comput. Appl. 2013, 22, 143–152. [Google Scholar] [CrossRef]
Qiu, M.; Song, Y. Predicting the Direction of Stock Market Index Movement Using an Optimized Artificial Neural Network Model. PLoS ONE 2016, 11, e0155133. [Google Scholar] [CrossRef]
Yang, H.L.; Lin, H.C. An Integrated Model Combined ARIMA, EMD with SVR for Stock Indices Forecasting. Int. J. Artif. Intell. Tools 2016, 25, 1650005. [Google Scholar] [CrossRef]
Chen, W.; Yeo, C.K.; Lau, C.T.; Lee, B.S. Leveraging social media news to predict stock index movement using RNN-boost. Data Knowl. Eng. 2018, 118, 14–24. [Google Scholar] [CrossRef]
Namdari, A.; Durrani, T.S. A Multilayer Feedforward Perceptron Model in Neural Networks for Predicting Stock Market Short-term Trends. Oper. Res. Forum 2021, 2, 38. [Google Scholar] [CrossRef]
Yu, Y.; Si, X.; Hu, C.; Zhang, J. A Review of Recurrent Neural Networks: LSTM Cells and Network Architectures. Neural Comput. 2019, 31, 1235–1270. [Google Scholar] [CrossRef]
Shah, D.; Campbell, W.; Zulkernine, F.H. A Comparative Study of LSTM and DNN for Stock Market Forecasting. In Proceedings of the 2018 IEEE International Conference on Big Data, Big Data 2018, Seattle, WA, USA, 10–13 December 2018; pp. 4148–4155. [Google Scholar] [CrossRef]
Althelaya, K.A.; El-Alfy, E.S.M.; Mohammed, S. Evaluation of Bidirectional LSTM for Short and Long-Term Stock Market Prediction. In Proceedings of the 2018 9th International Conference on Information and Communication Systems, ICICS 2018, Irbid, Jordan, 3–5 April 2018; pp. 151–156. [Google Scholar] [CrossRef]
Gupta, U.; Bhattacharjee, V.; Bishnu, P.S. StockNet—GRU based stock index prediction. Expert Syst. Appl. 2022, 207, 117986. [Google Scholar] [CrossRef]
Zrira, N.; Kamal-Idrissi, A.; Farssi, R.; Khan, H.A. Time series prediction of sea surface temperature based on BiLSTM model with attention mechanism. J. Sea Res. 2024, 198, 102472. [Google Scholar] [CrossRef]
Jagadesh, B.N.; Reddy, N.V.R.; Udayaraju, P.; Damera, V.K.; Vatambeti, R.; Jagadeesh, M.S.; Koteswararao, C. Enhanced stock market forecasting using dandelion optimization-driven 3D-CNN-GRU classification. Sci. Rep. 2024, 14, 20908. [Google Scholar] [CrossRef] [PubMed]
Niu, Z.; Zhong, G.; Yu, H. A review on the attention mechanism of deep learning. Neurocomputing 2021, 452, 48–62. [Google Scholar] [CrossRef]
Babu, C.N.; Reddy, B.E. Prediction of selected Indian stock using a partitioning–interpolation based ARIMA–GARCH model. Appl. Comput. Inform. 2015, 11, 130–143. [Google Scholar] [CrossRef]
Kurani, A.; Doshi, P.; Vakharia, A.; Shah, M. A Comprehensive Comparative Study of Artificial Neural Network (ANN) and Support Vector Machines (SVM) on Stock Forecasting. Ann. Data Sci. 2023, 10, 183–208. [Google Scholar] [CrossRef]
Xiao, C.; Xia, W.; Jiang, J. Stock price forecast based on combined model of ARI-MA-LS-SVM. Neural Comput. Appl. 2020, 32, 5379–5388. [Google Scholar] [CrossRef]
Nti, I.K.; Adekoya, A.F.; Weyori, B.A. A systematic review of fundamental and technical analysis of stock market predictions. Artif. Intell. Rev. 2019, 53, 3007–3057. [Google Scholar] [CrossRef]
Nti, I.K.; Adekoya, A.F.; Weyori, B.A. A comprehensive evaluation of ensemble learning for stock-market prediction. J. Big Data 2020, 7, 20. [Google Scholar] [CrossRef]
Zhang, Z.; Khushi, M. GA-MSSR: Genetic Algorithm Maximizing Sharpe and Sterling Ratio Method for RoboTrading. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–8. [Google Scholar] [CrossRef]
Tian, L.; Feng, L.; Yang, L.; Guo, Y. Stock price prediction based on LSTM and LightGBM hybrid model. J. Supercomput. 2022, 78, 11768–11793. [Google Scholar] [CrossRef]
Pang, X.; Zhou, Y.; Wang, P.; Lin, W.; Chang, V. An innovative neural network approach for stock market prediction. J. Supercomput. 2020, 76, 2098–2118. [Google Scholar] [CrossRef]
He, J.; Khushi, M.; Tran, N.H.; Liu, T. Robust Dual Recurrent Neural Networks for Financial Time Series Prediction. In Proceedings of the 2021 SIAM International Conference on Data Mining (SDM), Alexandria, VA, USA, 29 April–1 May 2021; pp. 747–755. [Google Scholar] [CrossRef]
Ding, G.; Qin, L. Study on the prediction of stock price based on the associated network model of LSTM. Int. J. Mach. Learn. Cybern. 2020, 11, 1307–1317. [Google Scholar] [CrossRef]
Bhandari, H.N.; Rimal, B.; Pokhrel, N.R.; Rimal, R.; Dahal, K.R.; Khatri, R.K. Predicting stock market index using LSTM. Mach. Learn. Appl. 2022, 9, 100320. [Google Scholar] [CrossRef]
Ghosh, P.; Neufeld, A.; Sahoo, J.K. Forecasting directional movements of stock prices for intraday trading using LSTM and random forests. Financ. Res. Lett. 2022, 46, 102280. [Google Scholar] [CrossRef]
Touzani, Y.; Douzi, K. An LSTM and GRU based trading strategy adapted to the Moroccan market. J. Big Data 2021, 8, 126. [Google Scholar] [CrossRef]
Fister, D.; Perc, M.; Jagrič, T. Two robust long short-term memory frameworks for trading stocks. Appl. Intell. 2021, 51, 7177–7195. [Google Scholar] [CrossRef] [PubMed]
Hu, Z.; Zhao, Y.; Khushi, M. A Survey of Forex and Stock Price Prediction Using Deep Learning. Appl. Syst. Innov. 2021, 4, 9. [Google Scholar] [CrossRef]
Lu, W.; Li, J.; Wang, J.; Qin, L. A CNN-BiLSTM-AM method for stock price prediction. Neural Comput. Appl. 2021, 33, 4741–4753. [Google Scholar] [CrossRef]
Duan, Y.; Liu, Y.; Wang, Y.; Ren, S.; Wang, Y. Improved BIGRU Model and Its Application in Stock Price Forecasting. Electronics 2023, 12, 2718. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30, 5999–6009. [Google Scholar]
Guo, M.H.; Xu, T.X.; Liu, J.J.; Liu, Z.N.; Jiang, P.T.; Mu, T.J.; Zhang, S.H.; Martin, R.R.; Cheng, M.M.; Hu, S.M. Attention mechanisms in computer vision: A survey. Comput. Vis. Media 2022, 8, 331–368. [Google Scholar] [CrossRef]
Chen, S.; Ge, L. Exploring the attention mechanism in LSTM-based Hong Kong stock price movement prediction. Quant. Financ. 2019, 19, 1507–1515. [Google Scholar] [CrossRef]
Liu, Y.; Gong, C.; Yang, L.; Chen, Y. DSTP-RNN: A dual-stage two-phase attention-based recurrent neural network for long-term and multivariate time series prediction. Expert Syst. Appl. 2020, 143, 113082. [Google Scholar] [CrossRef]
Soydaner, D. Attention mechanism in neural networks: Where it comes and where it goes. Neural Comput. Appl. 2022, 34, 13371–13385. [Google Scholar] [CrossRef]
Qiu, J.; Wang, B.; Zhou, C. Forecasting stock prices with long-short term memory neural network based on attention mechanism. PLoS ONE 2020, 15, e0227222. [Google Scholar] [CrossRef] [PubMed]
Zhao, J.; Zeng, D.; Liang, S.; Kang, H.; Liu, Q. Prediction model for stock price trend based on recurrent neural network. J. Ambient Intell. Humaniz. Comput. 2021, 12, 745–753. [Google Scholar] [CrossRef]
Zhang, J.; Ye, L.; Lai, Y. Stock Price Prediction Using CNN-BiLSTM-Attention Model. Mathematics 2023, 11, 1985. [Google Scholar] [CrossRef]
Wang, Z.; Li, H.; Tang, Z.; Liu, Y. User-Level Ultra-Short-Term Load Forecasting Model Based on Optimal Feature Selection and Bahdanau Attention Mechanism. J. Circuits Syst. Comput. 2021, 30, 2150279. [Google Scholar] [CrossRef]
Prakash, B.; Saleena, B. Stock Market Prediction Using Deep Attention Bi-directional Long Short-Term Memory. Comput. Econ. 2025, 66, 903–927. [Google Scholar] [CrossRef]
Wang, C.; Chen, Y.; Zhang, S.; Zhang, Q. Stock market index prediction using deep Transformer model. Expert Syst. Appl. 2022, 208, 118128. [Google Scholar] [CrossRef]
Wang, S. A Stock Price Prediction Method Based on BiLSTM and Improved Transformer. IEEE Access 2023, 11, 104211–104223. [Google Scholar] [CrossRef]
Jaggi, M.; Mandal, P.; Narang, S.; Naseem, U.; Khushi, M. Text Mining of Stocktwits Data for Predicting Stock Prices. Appl. Syst. Innov. 2021, 4, 13. [Google Scholar] [CrossRef]

Figure 1. Block diagram of proposed deep Bidirectional GRU Fused Attention (BiG-FA) model.

Figure 2. Forecast using the proposed Fused Attention Model (BiG-FA) on the NIFTY 50 stock index to effectively capture different temporal scales. The proposed model outperformed all other baseline and attention-based models, achieving MAE = 124.70, RMSE = 173.34, MAPE = 0.58%, and R² Score = 0.9955.

Figure 3. Forecast using BiGRU + Bahdanau Attention on NIFTY 50 stock index: yielded best results with MAE = 192.30, RMSE = 248.87, MAPE = 0.89%, and R² = 0.9908.

Figure 4. Forecast using BiLSTM + Bahdanau Attention on NIFTY 50 stock index: yielded MAE = 198.80, RMSE = 250.38, MAPE = 0.91%, and R² = 0.9906.

Figure 5. Forecast using GRU + Luong Attention on NIFTY 50 stock index: produced MAE = 209.74, RMSE = 271.52, MAPE = 0.94%, and R² = 0.9890.

Figure 6. Forecast using N-BEATS on NIFTY 50 stock index: achieved MAE = 270.86, RMSE = 341.48, MAPE = 1.24%, and R² = 0.9826.

Figure 7. Forecast using GRU + Bahdanau Attention on NIFTY 50 stock index: reported MAE = 300.00, RMSE = 401.01, MAPE = 1.36%, and R² = 0.9760.

Figure 8. Forecast using GRU on NIFTY 50 stock index: scored MAE = 330.71, RMSE = 411.02, MAPE = 1.50%, and R² = 0.9748.

Figure 9. Forecast using LSTM + Bahdanau Attention on NIFTY 50 stock index: registered MAE = 405.53, RMSE = 493.85, MAPE = 1.82%, and R² = 0.9612.

Figure 10. Forecast using GRU + Vanilla Attention on NIFTY 50 stock index: resulted in MAE = 631.98, RMSE = 672.55, MAPE = 2.95%, and R² = 0.9324.

Figure 11. Forecast using DeepAR on NIFTY 50 stock index: achieved MAE = 662.39, RMSE = 808.03, MAPE = 2.91%, and R² = 0.9024.

Figure 12. Forecast using LSTM + Luong Attention on NIFTY 50 stock index: achieved MAE = 794.24, RMSE = 920.11, MAPE = 3.62%, and R² = 0.8732.

Figure 13. Forecast using LSTM + Vanilla Attention on NIFTY 50 stock index: generated MAE = 854.19, RMSE = 995.65, MAPE = 3.86%, and R² = 0.8518.

Figure 14. Forecast using TCN model on NIFTY 50 stock index: achieved MAE = 1198.30, RMSE = 1479.59, MAPE = 5.27%, and R² = 0.6728.

Figure 15. Forecast using LSTM on NIFTY 50 stock index: produced poor results with MAE = 1472.77, RMSE = 1526.47, MAPE = 6.85%, and R² = 0.6518.

Figure 16. Forecast using CNN + LSTM + Attention on NIFTY 50 stock index: recorded the lowest performance with MAE = 1782.56, RMSE = 1884.56, MAPE = 8.19%, and R² = 0.4692.

Figure 17. Forecast using Random Forest on NIFTY 50 stock index: yielded MAE = 2855.10, RMSE = 3709.02, MAPE = 12.30%, and R² = −1.0560.

Figure 18. Forecast using XGBoost Model on NIFTY 50 stock index: yielded MAE = 3010.03, RMSE = 3861.51, MAPE = 13.01%, and R² = −1.2285.

Figure 19. Forecast using the proposed BiG-FA model on S&P 500 stock index: achieved the best overall performance with MAE = 30.13, RMSE = 40.29, MAPE = 0.63%, and R² Score = 0.9961.

Figure 20. Forecast using BiGRU + Bahdanau Attention on S&P 500 stock index: yielded strong results with MAE = 55.42, RMSE = 64.90, MAPE = 1.12, and R² = 0.9899.

Figure 21. Forecast using BiLSTM + Bahdanau Attention on S&P 500 stock index: achieved MAE = 55.48, RMSE = 65.71, MAPE = 1.12%, and R² = 0.9896.

Figure 22. Forecast using GRU + Bahdanau Attention on S&P 500 stock index: reported MAE = 66.13, RMSE = 74.73, MAPE = 1.36%, and R² = 0.9866.

Figure 23. Forecast using GRU + Luong Attention on S&P 500 stock index: resulted in MAE = 77.86, RMSE = 111.72, MAPE = 1.48%, and R² = 0.9699.

Figure 24. Forecast using N-BEATS on S&P 500 stock index: achieved MAE = 306.75, RMSE = 388.32, MAPE = 5.80%, and R² = 0.6369.

Figure 25. Forecast using GRU on S&P 500 stock index: produced MAE = 109.93, RMSE = 129.42, MAPE = 2.17%, and R² = 0.9597.

Figure 26. Forecast using LSTM + Bahdanau Attention on S&P 500 stock index: registered MAE = 116.96, RMSE = 139.26, MAPE = 2.28%, and R² = 0.9533.

Figure 27. Forecast using LSTM + Luong Attention on S&P 500 stock index: achieved MAE = 152.70, RMSE = 177.39, MAPE = 3.00%, and R² = 0.9242.

Figure 28. Forecast using GRU + Vanilla Attention on S&P 500 stock index: produced MAE = 165.55, RMSE = 189.70, MAPE = 3.28%, and R² = 0.9134.

Figure 29. Forecast using LSTM + Vanilla Attention on S&P 500 stock index: yielded MAE = 178.43, RMSE = 203.58, MAPE = 3.53%, and R² = 0.9002.

Figure 30. Forecast using CNN + LSTM + Attention on S&P 500 stock index: resulted in relatively poor performance with MAE = 273.28, RMSE = 312.32, MAPE = 5.33%, and R² = 0.7651.

Figure 31. Forecast using TCN on S&P 500 stock index: achieved MAE = 258.75, RMSE = 320.01, MAPE = 4.95%, and R² = 0.7534.

Figure 32. Forecast using LSTM on S&P 500 stock index: recorded the lowest performance with MAE = 296.58, RMSE = 323.50, MAPE = 5.89%, and R² = 0.7480.

Figure 33. Forecast using N-BEATS on S&P 500 stock index: achieved MAE = 306.75, RMSE = 388.32, MAPE = 5.80%, and R² = 0.6369.

Figure 34. Forecast using Random Forest on S&P 500 stock index: yielded MAE = 360.41, RMSE = 549.23, MAPE = 6.52%, and R² = 0.2737.

Figure 35. Forecast using XGBoost on S&P 500 stock index: yielded MAE = 400.13, RMSE = 598.67, MAPE = 7.26%, and R² = 0.1370.

Figure 36. Visualization of Attention Weights—NIFTY 50 stock index.

Figure 37. Visualization of Attention Weights—S&P 500 stock index.

Figure 38. Integrated Gradient over Time Steps—NIFTY 50 stock index.

Figure 39. Integrated Gradient over time steps—S&P 500 stock index.

Figure 40. SHAP Summary Plot: Feature Impact on Prediction for NIFTY 50 index.

Figure 41. SHAP summary plot: feature impact on prediction for S&P 500 index.

Table 1. Performance comparison of models on NIFTY 50 index close price prediction.

Model	MAE	RMSE	MAPE	R² Score
BiGRU + Fused Attention (Sparse + Global + Bahdanau)	124.70	173.34	0.58%	0.9955
BiGRU + Bahdanau Attention	192.30	248.87	0.89%	0.9908
BiLSTM + Bahdanau Attention	198.80	250.38	0.91%	0.9906
GRU + Luong Attention	209.74	271.52	0.94%	0.9890
N-BEATS	270.86	341.48	1.24%	0.9826
GRU + Bahdanau Attention	300.00	401.01	1.36%	0.9760
GRU	330.71	411.02	1.50%	0.9748
LSTM + Bahdanau Attention	405.53	493.85	1.82%	0.9612
GRU + Vanilla Attention	631.98	672.55	2.95%	0.9324
DeepAR	662.39	808.03	2.91%	0.9024
LSTM + Luong Attention	794.24	920.11	3.62%	0.8732
LSTM + Vanilla Attention	854.19	995.65	3.86%	0.8518
TCN	1198.30	1479.59	5.27%	0.6728
LSTM	1472.77	1526.47	6.85%	0.6518
CNN + LSTM + Attention	1782.56	1884.56	8.19%	0.4692
Random Forest	2855.10	3709.02	12.30%	−1.0560
XGBoost	3010.03	3861.51	13.01%	−1.2285

Table 2. Performance comparison of models on S&P 500 index close price prediction.

Model	MAE	RMSE	MAPE	R² Score
BiGRU + Fused Attention (Sparse + Global + Bahdanau)	30.13	40.29	0.63%	0.9961
BiGRU + Bahdanau Attention	55.42	64.90	1.12%	0.9899
BiLSTM + Bahdanau Attention	55.48	65.71	1.12%	0.9896
GRU + Bahdanau Attention	66.13	74.73	1.36%	0.9866
GRU + Luong Attention	77.86	111.72	1.48%	0.9699
DeepAR	99.17	123.80	1.92%	0.9631
GRU	109.93	129.42	2.17%	0.9597
LSTM + Bahdanau Attention	116.96	139.26	2.28%	0.9533
LSTM + Luong Attention	152.70	177.39	3.00%	0.9242
GRU + Vanilla Attention	165.55	189.70	3.28%	0.9134
LSTM + Vanilla Attention	178.43	203.58	3.53%	0.9002
CNN + LSTM + Attention	273.28	312.32	5.33%	0.7651
TCN	258.75	320.01	4.95%	0.7534
LSTM	296.58	323.50	5.89%	0.7480
N-BEATS	306.75	388.32	5.80%	0.6369
Random Forest	360.41	549.23	6.52%	0.2737
XGBoost	400.13	598.67	7.26%	0.1370

Table 3. Performance of the proposed BiGRU–Fused Attention model under crisis regimes for S&P 500 and NIFTY 50.

Experiment	MAE	RMSE	MAPE (%)	R²
S&P 500–COVID-19 Crisis	95.0442	114.2608	3.29	0.8167
S&P 500–Global Financial Crisis	31.4984	43.6248	3.08	0.9734
NIFTY 50–COVID-19 Crisis	317.4508	414.1400	3.30	0.8711
NIFTY 50–Global Financial Crisis	31.6047	43.8923	3.15	0.9731

Table 4. Performance Comparison of models on AAPL close price prediction.

Model	MAE	RMSE	MAPE	R² Score
BiGRU + Fused Attention	3.03	3.82	1.55%	0.9807
BiGRU + Bahdanau Attention	3.81	5.07	1.90%	0.9660
N-BEATS	4.14	5.14	2.22%	0.9651
GRU + Bahdanau Attention	3.87	5.22	1.95%	0.9640
BiLSTM + Bahdanau Attention	5.32	6.67	2.85%	0.9412
GRU + Vanilla Attention	6.33	8.44	3.16%	0.9058
GRU + Luong Attention	6.84	8.52	3.70%	0.9041
DeepAR	6.83	8.54	3.61%	0.9036
TCN	7.47	9.27	3.85%	0.8865
LSTM + Luong Attention	8.17	10.19	4.19%	0.8628
GRU	8.40	10.32	4.32%	0.8592
LSTM + Bahdanau Attention	8.57	10.72	4.20%	0.8481
LSTM + Vanilla Attention	11.06	11.97	5.74%	0.8105
LSTM	10.14	12.59	5.10%	0.7903
CNN + LSTM + Attention	11.39	14.78	5.72%	0.7113
Random Forest	20.62	30.51	9.61%	−0.2308
XGBoost	22.98	32.94	10.79%	−0.4345

Table 5. Performance Comparison of models on MSFT close price prediction.

Model	MAE	RMSE	MAPE	R² Score
BiGRU + Fused Attention	10.08	12.08	2.77%	0.9612
BiGRU + Bahdanau Attention	10.70	12.43	2.80%	0.9589
GRU + Luong Attention	14.91	18.34	4.17%	0.9106
BiLSTM + Bahdanau Attention	15.41	18.34	4.11%	0.9106
GRU + Bahdanau Attention	17.63	20.60	4.76%	0.8871
GRU	23.46	27.94	6.17%	0.7925
N-BEATS	24.20	28.08	6.21%	0.7904
DeepAR	25.32	28.57	6.53%	0.7829
LSTM + Luong Attention	24.60	29.20	6.53%	0.7733
LSTM + Vanilla Attention	25.63	31.24	6.71%	0.7406
GRU + Vanilla Attention	26.15	32.16	6.72%	0.7250
LSTM + Bahdanau Attention	26.78	32.34	6.91%	0.7219
TCN	30.24	35.54	7.78%	0.6641
LSTM	31.06	37.59	7.96%	0.6242
CNN + LSTM + Attention	49.94	58.77	12.74%	0.0816
Random Forest	52.54	68.74	12.86%	−0.2564
XGBoost	59.06	76.55	14.48%	−0.5582

Table 6. Performance comparison of models on AMZN close price prediction.

Model	MAE	RMSE	MAPE	R² Score
BiGRU + Fused Attention	2.87	3.71	1.85%	0.9894
DeepAR	4.30	5.24	3.08%	0.9788
N-BEATS	5.53	6.55	4.09%	0.9669
BiGRU + Bahdanau Attention	5.88	7.23	3.91%	0.9597
GRU + Luong Attention	8.21	10.48	5.95%	0.9152
GRU + Bahdanau Attention	6.13	11.49	3.40%	0.8982
BiLSTM + Bahdanau Attention	9.27	11.65	5.66%	0.8953
LSTM + Luong Attention	9.12	11.68	6.37%	0.8948
LSTM + Vanilla Attention	9.71	12.37	6.73%	0.8819
LSTM + Bahdanau Attention	10.07	12.88	6.87%	0.8721
GRU	11.81	13.34	8.50%	0.8628
TCN	11.15	13.71	7.65%	0.8551
LSTM	11.69	13.79	8.76%	0.8532
XGBoost	9.54	16.68	5.16%	0.7854
Random Forest	13.87	17.65	9.05%	0.7597
GRU + Vanilla Attention	13.11	17.73	8.12%	0.7576
CNN + LSTM + Attention	16.70	19.86	11.43%	0.6959

Table 7. Performance comparison of models on GOOGL close price prediction.

Model	MAE	RMSE	MAPE	R² Score
BiGRU + Fused Attention	2.14	2.90	1.56%	0.9885
BiLSTM + Bahdanau Attention	3.48	4.32	2.71%	0.9745
BiGRU + Bahdanau Attention	3.59	4.71	2.61%	0.9696
GRU + Bahdanau Attention	6.03	7.31	4.89%	0.9269
GRU + Luong Attention	6.03	7.68	4.28%	0.9193
GRU	6.31	7.72	4.54%	0.9184
LSTM + Luong Attention	7.57	9.37	5.60%	0.8799
TCN	8.70	10.34	6.87%	0.8537
LSTM + Vanilla Attention	8.80	10.79	6.50%	0.8406
DeepAR	11.46	12.72	9.19%	0.7786
LSTM + Bahdanau Attention	10.00	13.18	6.85%	0.7621
N-BEATS	10.95	13.36	7.91%	0.7556
LSTM	11.06	13.86	7.85%	0.7371
GRU + Vanilla Attention	11.61	13.93	8.50%	0.7345
CNN + LSTM + Attention	11.57	15.05	8.00%	0.6898
XGBoost	10.45	16.33	6.35%	0.6349
Random Forest	10.51	16.60	6.36%	0.6229

Table 8. Performance comparison of models on META close price prediction.

Model	MAE	RMSE	MAPE	R² Score
BiGRU + Bahdanau Attention	11.04	14.28	3.06%	0.9891
BiGRU + Fused Attention	11.83	15.34	3.13%	0.9875
BiLSTM + Bahdanau Attention	20.93	26.58	5.23%	0.9623
GRU	23.22	27.63	6.39%	0.9593
N-BEATS	22.77	29.67	5.48%	0.9531
GRU + Bahdanau Attention	24.74	30.61	6.43%	0.9500
DeepAR	28.39	32.73	8.58%	0.9429
TCN	29.21	39.18	6.64%	0.9182
GRU + Luong Attention	29.31	39.24	6.73%	0.9179
GRU + Vanilla Attention	31.86	42.47	7.78%	0.9038
LSTM + Luong Attention	37.47	50.13	8.55%	0.8660
LSTM	37.51	52.39	8.88%	0.8537
LSTM + Bahdanau Attention	47.54	62.04	11.25%	0.7948
LSTM + Vanilla Attention	61.27	83.64	13.53%	0.6270
CNN + LSTM + Attention	70.26	99.25	14.86%	0.4748
Random Forest	70.99	105.70	13.89%	0.4044
XGBoost	78.73	116.26	15.44%	0.2794

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Khansama, R.R.; Priyadarshini, R.; Nanda, S.K.; Barik, R.K. Capturing Short- and Long-Term Temporal Dependencies Using Bahdanau-Enhanced Fused Attention Model for Financial Data—An Explainable AI Approach. FinTech 2026, 5, 4. https://doi.org/10.3390/fintech5010004

AMA Style

Khansama RR, Priyadarshini R, Nanda SK, Barik RK. Capturing Short- and Long-Term Temporal Dependencies Using Bahdanau-Enhanced Fused Attention Model for Financial Data—An Explainable AI Approach. FinTech. 2026; 5(1):4. https://doi.org/10.3390/fintech5010004

Chicago/Turabian Style

Khansama, Rasmi Ranjan, Rojalina Priyadarshini, Surendra Kumar Nanda, and Rabindra Kumar Barik. 2026. "Capturing Short- and Long-Term Temporal Dependencies Using Bahdanau-Enhanced Fused Attention Model for Financial Data—An Explainable AI Approach" FinTech 5, no. 1: 4. https://doi.org/10.3390/fintech5010004

APA Style

Khansama, R. R., Priyadarshini, R., Nanda, S. K., & Barik, R. K. (2026). Capturing Short- and Long-Term Temporal Dependencies Using Bahdanau-Enhanced Fused Attention Model for Financial Data—An Explainable AI Approach. FinTech, 5(1), 4. https://doi.org/10.3390/fintech5010004

Article Menu

Capturing Short- and Long-Term Temporal Dependencies Using Bahdanau-Enhanced Fused Attention Model for Financial Data—An Explainable AI Approach

Abstract

1. Introduction

2. Literature Review

3. Proposed Model Description

3.1. Input Representation

3.2. BiGRU Encoder

3.3. Sparse Attention

3.4. Global Attention

3.5. Bahdanau Attention

3.6. Fusion and Output Prediction

3.7. Optimization

3.8. Rationale Behind the Proposed Architecture

4. Dataset Description and Preprocessing

Data Preparation

5. Result Analysis

5.1. Performance on NIFTY 50 Index

5.2. Performance Comparison on S&P 500 Stock Index

5.3. Regime-Based Stress Testing and Crisis Generalization

6. Performance Evaluation on U.S. Stocks

6.1. Performance Comparison of Models on AAPL Stock Close Price Prediction

6.2. Performance Comparison of Models on MSFT Stock Close Price Prediction

6.3. Performance Comparison of Models on AMZN Stock Close Price Prediction

6.4. Performance Comparison of Models on GOOGL Stock Close Price Prediction

6.5. Performance Comparison of Models on META Stock Close Price Prediction

6.6. Comparative Discussion Across All Stocks

7. Explainability of the Proposed BiG-FA Model

7.1. Attention-Based Interpretability of Price Sequences

7.1.1. Top-k Sparse Attention

7.1.2. Global Attention

7.1.3. Bahdanau Attention

7.2. Integrated Gradients Attribution

7.3. SHAP Value Interpretation

7.4. Interpretability Alignment and Insights

7.5. Coherence Between Performance and Attention Mechanisms

7.6. Limitations and Future Work

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI