IGWO-MALSTM: An Improved Grey Wolf-Optimized Hybrid LSTM with Multi-Head Attention for Financial Time Series Forecasting

Zhu, Mingfu; Qi, Haoran; Qin, Panke

doi:10.3390/app15126619

Open AccessArticle

IGWO-MALSTM: An Improved Grey Wolf-Optimized Hybrid LSTM with Multi-Head Attention for Financial Time Series Forecasting

by

Mingfu Zhu

^1,2,

Haoran Qi

¹

and

Panke Qin

^1,2,*

¹

School of Computer Science and Technology, Henan Polytechnic University, Jiaozuo 454150, China

²

Hebi National Optoelectronic Technology Co., Ltd., Hebi 458000, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(12), 6619; https://doi.org/10.3390/app15126619

Submission received: 9 May 2025 / Revised: 9 June 2025 / Accepted: 10 June 2025 / Published: 12 June 2025

Download

Browse Figures

Versions Notes

Abstract

In the domain of financial markets, deep learning techniques have emerged as a significant tool for the development of investment strategies. The present study investigates the potential of time series forecasting (TSF) in financial application scenarios, aiming to predict future spreads and inform investment decisions more effectively. However, the inherent nonlinearity and high volatility of financial time series pose significant challenges for accurate forecasting. To address these issues, this paper proposes the IGWO-MALSTM model, a hybrid framework that integrates Improved Grey Wolf Optimization (IGWO) for hyperparameter tuning and a multi-head attention (MA) mechanism to enhance long-term sequence modeling within the long short-term memory (LSTM) architecture. The IGWO algorithm improves population diversity during initialization using the Mersenne Twister, thereby enhancing the convergence speed and search capability of the optimizer. Simultaneously, the MA mechanism mitigates gradient vanishing and explosion problems, enabling the model to better capture long-range dependencies in financial sequences. Experimental results on real futures market data demonstrate that the proposed model reduces Mean Square Error (MSE) by up to 61.45% and Mean Absolute Error (MAE) by 44.53%, and increases the R² score by 0.83% compared to existing benchmark models. These findings confirm that IGWO-MALSTM offers improved predictive accuracy and stability for financial time series forecasting tasks.

Keywords:

time series forecasting (TSF); financial markets; long short-term memory (LSTM); improved Grey Wolf Optimization (IGWO); multi-head attention mechanism (MA); model optimization; stock price prediction

1. Introduction

Stock price prediction is essential in the financial domain, shaping investment strategies, risk management, and market efficiency [1]. Stock price fluctuations in financial markets are influenced by multiple factors, including economic indicators, market sentiment, and policy changes [2,3], resulting in highly nonlinear and complex characteristics of financial time series data. Accurate prediction of stock price trends remains a critical challenge in finance.

Over the years, numerous traditional machine learning techniques, including support vector machines (SVM) [4], decision trees [5], and k-nearest neighbors (KNN) [6], have been employed to address this challenge. While these methodologies have yielded valuable insights and proven effective in certain scenarios, they are encumbered by significant limitations in the context of complex financial markets. Specifically, they frequently fail to adequately capture the intricacies of financial markets [7,8]. The deficiencies of these models, especially in their capacity to manage nonlinear relationships and long-term dependencies, have prompted researchers to investigate more sophisticated methodologies. In addition to these machine learning approaches, non-AI-based methods such as state modeling have also been applied to financial forecasting tasks. For instance, Stasiak et al. [9] proposed a state model with binary–temporal representation to effectively model and predict crude oil prices, highlighting the continued relevance of traditional modeling paradigms in specific financial contexts.

The advent of deep learning has precipitated the emergence of more sophisticated models, which have been shown to effectively capture nonlinear features in financial data. It is noteworthy that long short-term memory (LSTM) networks [10], convolutional neural networks (CNNs) [11], and recurrent neural networks (RNNs) [12] have been shown to be particularly effective for time series forecasting. LSTM networks have become a subject of particular interest due to their capacity to model temporal dependencies and manage sequential data effectively [13,14,15]. Prior research has shown that LSTM models effectively capture short-term market trends, leading to significant improvements in stock price prediction accuracy [16,17,18]. Nevertheless, despite their effectiveness, LSTM models are confronted with certain challenges. Firstly, the necessity of manual hyperparameter tuning is a key issue. This may not fully exploit the model’s potential and can result in suboptimal performance. Furthermore, LSTM networks are susceptible to gradient explosion issues during the training process, which results in substantial limitations with regard to modelling long-term dependencies [10,19]. Such constraints have the potential to result in a reduction in the accuracy of predictions.

To address these challenges, swarm intelligence optimization algorithms have been integrated into the hyperparameter tuning process for LSTM models. These algorithms mimic collective behaviors in nature, combining local and global search strategies to identify optimal solutions in high-dimensional search spaces [20]. By leveraging swarm intelligence to optimize LSTM hyperparameters, model performance can be significantly enhanced, reducing the time and complexity of manual tuning while improving prediction accuracy [21]. Additionally, the introduction of Transformer models offers an effective solution for capturing long-term dependencies. The self-attention mechanism in Transformers calculates relationships between different positions in the input sequence, enabling the model to flexibly focus on critical parts of the sequence [22]. This effectively mitigates the gradient explosion issues encountered during LSTM training. Combining swarm intelligence for optimizing LSTM hyperparameters with the self-attention mechanism provides a more accurate and efficient solution for tackling complex time series forecasting problems.

This paper addresses the challenges of price prediction and volatility analysis in financial markets by proposing the IGWO-MALSTM model. The proposed model employs IGWO to optimize LSTM hyperparameters, effectively navigating complex search spaces to identify optimal hyperparameter combinations. Additionally, it incorporates a self-attention mechanism to automatically assign weights to information features, enabling the model to more effectively identify and leverage critical data characteristics, thereby enhancing long-term prediction accuracy. The primary contributions of this hybrid model are summarized as follows:

To address the inherent limitations of LSTM models, such as inaccurate initial connection weights and hyperparameter selection, this paper introduces the IGWO algorithm, which effectively resolves the issue of imprecise hyperparameter acquisition.
To address the deficiency in LSTM models’ long-term forecasting capabilities, the integration of a self-attention mechanism with the LSTM model mitigates potential gradient explosion issues during training, thereby enhancing the model’s ability to perform long-term predictions.

This paper is organized as follows: Section 2 reviews the related Work. Section 3 introduces the model used in this article. Section 4 introduces the experiment. Section 5 presents the conclusion of this article.

2. Related Work

2.1. Traditional Methods

In early research, financial time series prediction mainly relied on econometric or statistical methods. The autoregressive integrated moving average model (ARIMA) and its enhancements are effective approaches for addressing this issue in econometrics and statistics [23]. Subsequently, ARIMA has been applied to time series forecasting in finance, resulting in several enhanced algorithms, including autoregressive (AR) [24], vector autoregressive (VAR) [25], autoregressive distributed lag (ARDL) [26], autoregressive conditional heteroskedasticity (ARCH) [27], generalized ARCH (GARCH) [28], and mixed data sampling (MIDAS) [29]. Although these techniques can be successfully used for short-term prediction, they are not suitable for nonlinear problems and have poor long-term prediction performance [30,31].

To address these limitations machine learning is introduced to analyze time series and successfully applied to stock price forecasting [32,33]. Research indicates that machine learning algorithms, through adaptive noise suppression mechanisms such as regularization constraints or feature importance weighting, can effectively separate high-frequency noise from underlying signals in financial time series, thereby capturing the inherent nonlinear correlation patterns of data more accurately and significantly improving the prediction accuracy [34]. Machine learning methods include support vector machine (SVM), decision tree, naive Bayes, and random forest. Wang et al. used a hybrid model of decision tree and SVM to predict future price trends [35]. Chen et al. established a feature-weighted SVM and K-nearest neighbor algorithm to predict the stock market index [36]. Avraam Tsantekidis et al. used convolutional neural networks for stock price prediction and found it outperformed multilayer perceptrons and support vector machines in predictive capability [37,38]. By leveraging large-scale financial datasets, AI algorithms can identify complex nonlinear patterns and subtle market signals that are often overlooked by traditional methods. This capability enables more accurate stock price forecasting, early detection of market trends, and data-driven investment strategies that enhance portfolio performance [39,40,41,42].

2.2. Deep Learning Approaches

Deep learning technology has been extensively utilized in the field of financial time series prediction, particularly in the context of short-term prediction tasks. A significant number of studies have focused on the capture of short-term market volatility and temporary market sentiment changes through models such as deep neural networks (DNN), convolutional neural networks (CNN), and long-term and short-term memory networks (LSTM). In the study conducted by Ding et al., the utilization of the LSTM network proved to be a successful method for predicting short-term fluctuations in stock prices [43]. Rasheed. et al. further enhanced the prediction accuracy by integrating the CNN and the LSTM algorithms. These methods have been shown to exhibit enhanced resilience to short-term dependencies and rapid responsiveness to market fluctuations [44].

The development of LSTM networks was aimed at overcoming some of these limitations. LSTM networks, with their unique memory cells and gating mechanisms, can effectively retain and update information over long sequences. Specifically, LSTMs utilize input gates, forget gates, and output gates to regulate the flow of information, selectively preserving critical historical data and mitigating the vanishing gradient problem. This makes LSTMs particularly well-suited for time series forecasting tasks involving long-term dependencies, especially with complex dynamic data. Studies have shown that LSTM networks significantly outperform traditional models in stock price prediction by effectively capturing complex temporal dependencies, especially in short-term forecasting [45,46].

Furthermore, recent studies have begun incorporating graph neural networks (GNNs) to model relationships among assets or sectors, capturing spatial dependencies alongside temporal ones. Wang et al. [47] applied a GNN-based approach for dynamic portfolio allocation, showing improved prediction of asset co-movements.

2.3. Limitations of LSTM

Despite their successes, LSTM models are not without limitations, particularly in hyperparameter selection and long-term stock price forecasting. Research indicates that while LSTM models are robust for short- to medium-term predictions, their performance tends to degrade over longer time horizons. To address these limitations, various improvements and alternative approaches have been introduced.

In recent years, swarm intelligence optimization algorithms, such as Particle Swarm Optimization (PSO) [48], Ant Colony Optimization (ACO) [49], and the Bat Algorithm (BA) [50], have been widely applied to optimize LSTM hyperparameters. These algorithms emulate collaborative behaviors in nature, enabling them to identify optimal hyperparameter configurations in multidimensional search spaces. For instance, L Han et al. [51] proposed a CS-LSTM approach, using Cuckoo Search (CS) to optimize LSTM’s learning rate, number of layers, and neuron counts, significantly enhancing LSTM’s performance in stock price prediction. Similarly, Wang and Li [52] employed ACO to optimize LSTM hyperparameters, demonstrating that the optimized LSTM outperformed traditional LSTM models in both prediction accuracy and convergence speed.

In addition to improving LSTM architectures through optimization techniques, researchers have also proposed entirely new forecasting frameworks. Notably, the N-BEATS model [53] employs a pure deep learning architecture with backward and forward residual links to achieve highly accurate and interpretable univariate time series forecasting, without relying on recurrent or convolutional components.

Furthermore, to address LSTM’s limitations in long-term forecasting, many studies have explored integrating Transformer’s self-attention mechanism. The self-attention mechanism allows Transformers to weigh different parts of the input sequence, effectively capturing long-term dependencies. Kabir et al. [54] proposed a hybrid LSTM-Transformer model that leverages LSTM for short-term dependencies and Transformer for long-term dependencies. Experimental results showed that this hybrid model achieved higher accuracy and stability in stock price prediction, particularly in long-term scenarios. Similar studies have also demonstrated that combining Transformer with LSTM can overcome the limitations of traditional LSTM models in handling long-term trends, thereby improving prediction accuracy [55].

Significant progress has been made in enhancing LSTM models for financial time series forecasting by optimizing hyperparameters with swarm intelligence algorithms and integrating Transformer’s self-attention mechanism. These advancements have yielded remarkable results, particularly in addressing LSTM’s deficiencies in long-term forecasting.

3. Methodology

3.1. The LSTM Model

LSTM consists of three layers: input, hidden, and output. In contrast to standard neural networks, LSTM hidden layer neurons are linked to each other. This interconnectedness fosters correlations between data points, enabling the model to grasp intricate patterns and dependencies within the input sequence. To tackle this challenge, LSTMs employ three essential gates: a forgetting gate, an input gate, and an output gate. These gates meticulously manage the flow of information, ensuring that vital long-term information is retained while irrelevant input data are discarded. Visually, the LSTM model resembles a chain-like structure, as illustrated in Figure 1, comprising repeating basic units known as memory cells.

The input gate

i_{t}

controls the flow of input data into the cell state, deciding what information needs updating and what is essential for accurate predictions. Meanwhile, the Forgetting Gate

f_{t}

filters out unnecessary information from the previous step and updates the cell state accordingly. Lastly, the output gate

o_{t}

determines the output value to be transmitted to the next hidden cell. Here are the equations governing the gates in an LSTM:

\{\begin{array}{l} i_{c} = σ (W_{i i} x_{t} + b_{i i} + W_{k i} h_{t - 1} + b_{h i}) \\ f_{t} = σ (W_{i f} x_{t} + b_{i f} + W_{h f} h_{t - 1} + b_{h f}) \\ g_{t} = \tanh (W_{i f} x_{t} + b_{i g} + W_{h g} h_{t - 1} + b_{h g}) \\ o_{t} = σ (W_{i o} x_{t} + b_{i o} + W_{h o} h_{t - 1} + b_{h o}) \\ c_{t} = f_{t} * c_{t - 1} + i_{t} * g_{t} \\ h_{t} = o_{t} * \tanh (c_{t}) \end{array}

(1)

h_{t}

is the hidden state at time t and

c_{t}

represents the neuron’s “memory” at that moment. The input at time t is

x_{t}

, while

h_{t - 1}

is either the hidden state at time t − 1 or the initial hidden state at time 0. The input, forgetting, unit, and output gates are denoted by

i_{t}

,

f_{t}

,

g_{t}

, and

o_{t}

, respectively. Here,

σ

represents the sigmoid function, and * signifies element-wise vector multiplication.

3.2. Grey Wolf Optimization

The GWO algorithm is inspired by the hunting tactics of gray wolves. It approximates solutions for optimization and search problems by mimicking the wolves’ behaviors of encircling, chasing, and attacking their prey. This algorithm simulates the collaborative and competitive behaviors of wolf packs during hunting, evolving solutions through generations. It comprises three core phases: encircling, hunting, and attacking.

(1) Encircling Prey: Gray wolves slowly close in on their prey by encircling it, and this behavior is mathematically modeled as follows:

\{\begin{array}{l} D = C ○ X_{p} (t) - X (t) \\ X (t + 1) = X_{p} (t) - A ○ D \\ A = 2 a ○ r_{1} - a \\ C = 2 r_{2} \\ a (t) = 2 - (2 * t) / MaxIter \end{array}

(2)

In this context,

X_{p}

denotes the prey’s position, while

X (t)

is the position vector of the gray wolf. The current iteration number is t. Coefficient vectors A and C are used, along with random vectors

r_{1}

and

r_{2}

ranging between 0 and 1. Throughout the iterations, the vector elements linearly decrease from 2 to 0.

(2) Hunting: In the hunting phase, gray wolves can detect potential prey locations (optimal solutions), with α, β, and δ wolves guiding the search. Since the solution space often has unknown characteristics, the exact position of the optimal solution may be unclear. To mimic the wolves’ search behavior, it is presumed that the α, β, and δ wolves excel at locating potential prey. Thus, in each iteration, the top three wolves (α, β, δ) are preserved, and the positions of other search agents, including ω, are updated according to the locations of these leading wolves. ω denotes the position of the search agent in the Grey Wolf Optimization algorithm. This behavior can be mathematically represented as follows:

\{\begin{array}{l} D_{α} = C_{1} ○ X_{α} - X (t) \\ D_{β} = C_{2} ○ X_{β} - X (t) \\ D_{δ} = C_{3} ○ X_{δ} - X (t) \end{array}

(3)

\{\begin{array}{l} X_{i 1} (t) = X_{α} (t) - A_{i 1} ○ D_{α} (t) \\ X_{i 2} (t) = X_{β} (t) - A_{i 2} ○ D_{β} (t) \\ X_{i 3} (t) = X_{δ} (t) - A_{i 3} ○ D_{δ} (t) \end{array}

(4)

X (t + 1) = \frac{X_{i 1} (t) + X_{i 2} (t) + X_{i 3} (t)}{3}

(5)

X_{α}

,

X_{β}

, and

X_{δ}

describe the position vectors of the

α

,

β

, and

δ

wolves, respectively, within the existing group; X denotes the position vector of a gray wolf;

D_{α}

,

D_{β}

, and

D_{δ}

represent the distances between the current candidate wolf and the top three wolves.

(3) Attacking prey: When designing the predator–prey attack model, decreasing ‘

a

’ causes ‘

A

’ to vary, as defined by the encirclement formula. In essence, ‘

A

’ represents a random vector within the interval [−a, a], and ‘

a

’ diminishes progressively with every iteration. When ‘

A

’ falls between [−1, 1], the Search Agent’s subsequent position may lie between the current gray wolf and the prey.

3.2.1. Key Steps in the Grey Wolf Optimizer Algorithm

The position of each search agent is updated by a mechanism inspired by the hunting behavior of grey wolves. The update rule is based on the current position of the agent and its distance from the three best agents (

α

,

β

,

δ

). This ensures that the search agents converge towards the optimal solution while also exploring the search space. The position update is mathematically formulated as:

X_{i}^{t + 1} = X_{α} - A \cdot | C \cdot X_{α} - X_{i}^{t}

(6)

Here,

X_{i}^{t}

represents the current position of the agent, and

X_{α}

is the position of the best search agent (the

α

wolf). The parameters

A

and

C

are coefficient vectors that guide the agent’s movement towards the optimal solution.

The coefficients

α

,

A

, and

C

are updated during each iteration. The parameter α decreases linearly, shifting the algorithm from exploration (initial stages) to exploitation (later stages). The vectors

A

and

C

are computed using random values, ensuring that the search agents move towards the prey (optimal solution) while maintaining randomness to prevent local optima stagnation. Specifically,

A

is updated as:

\{\begin{array}{l} A = 2 α \cdot r_{1} - α \\ C = 2 \cdot r_{1} \end{array}

(7)

Each search agent’s fitness is evaluated at every iteration, determining how well it performs according to the problem’s objective function. The three best solutions—

α

,

β

, and

δ

—are then identified. The

α

wolf represents the global best solution, while the

β

and

δ

wolves represent the second and third best solutions, respectively. The positions of the

α

,

β

, and

δ

wolves guide the other agents and ensure that the search is directed towards the optimal solution.

The positions of the

α

,

β

, and

δ

wolves are updated regularly, and if any agent surpasses the current best solution, it will replace the respective wolf in the hierarchy. This process ensures that the best solutions remain as the reference for other agents. Over time, the algorithm moves closer to the global optimum by updating these best positions, leading to an effective search for optimal solutions in complex problem spaces.

The computational complexity of the IGWO algorithm is primarily determined by the number of search agents

N

, the number of iterations

T

, and the dimensionality of the problem

D

. In each iteration, the fitness of all search agents is evaluated (

O (N)

), and their positions are updated based on three leader wolves (

O (N \cdot D)

). Therefore, the overall complexity of the algorithm is approximately

O (T \cdot N \cdot D)

. Since fitness evaluation is model-based (e.g., training a neural network), its cost dominates the computation, and reducing training time or using early stopping can significantly improve efficiency.

3.2.2. Improved Grey Wolf Optimization

Since the method used by the gray wolf algorithm to initialize the gray wolf population is an ordinary random number generator, these generators may have the problems of short period and poor randomness, resulting in the generation of the initial population of low quality and may also lead to the initial population may not be able to cover the entire search space well, thus affecting the global search capability of the algorithm. In order to solve this problem, we proposed Mersenne Twister-based Grey Wolf Optimization on the basis of the Grey Wolf Optimization algorithm, which uses Mersenne Twister to generate random numbers with long period and good randomness to improve the diversity and coverage of the initial populations, and effectively solves the problem of Grey Wolf Optimization.

The Mersenne Twister algorithm is used to initialize the grey wolf population in the GWO algorithm. This initialization step is crucial, as it generates a set of initial solutions for the population, which serves as the starting point for the optimization process. By employing the Mersenne Twister, we ensure that the initial solutions are distributed uniformly and randomly, which enhances the diversity of the population. A diverse population allows the algorithm to explore the solution space more effectively, thereby improving its overall performance and convergence speed.

The pseudo-code for the algorithm begins by initializing the grey wolf population

X_{i} = (w h e r e i = 1, 2, \dots, n)

using the Mersenne Twister. Additionally, the parameters

A

C

are initialized to control the search process. The fitness of each search agent is then calculated, identifying the positions of the best, second-best, and third-best search agents, denoted as

X_{α}

,

X_{β}

and,

X_{δ}

, respectively.

The algorithm then enters a loop that continues until the maximum number of iterations is reached. In each iteration, the positions of all search agents are updated based on their current position and their proximity to the best positions found so far. The parameters

α

,

A

and C are updated, and the fitness of all agents is recalculated. The positions of the best agents

X_{α}

,

X_{β}

and

X_{δ}

are updated accordingly. The loop continues until the stopping criterion, defined by the maximum number of iterations, is satisfied.

Finally, the best solution, represented by

X_{α}

, is returned as the optimal solution found by the algorithm. This process allows the GWO algorithm to balance exploration and exploitation, effectively converging toward an optimal solution. The pseudo-code for the Algorithm 1 is as follows:

Algorithm 1: The Pseudo-Code of the IGWO Algorithm

The pseudo-code of the algorithm is as follows

Initialize the grey wolf population

X_{i}

(i = 1, 2, …, n) with Mersenne Twister

Initialize α, A, and C

Calculate the fitness of each search agent

X_{α}

= the best search agent

X_{β}

= the second best search agent

X_{δ}

= the third best search agent

While (t < Max number of iterations)

for each search agent

Update the position of the current search

end for

Update a, A, and C

Calculate the fitness of all search agents

Update

X_{α}

,

X_{β}

,

X_{δ}

t = t + 1

end while

return

X_{α}

The overall workflow of the IGWO algorithm is illustrated in Figure 2.

3.3. The Multi-Head Attention Mechanism

The Multi-Head Attention mechanism in the Transformer parallelizes the self-attention process by mapping the input queries, keys, and values into multiple subspaces. Self-attention is computed for each subspace, and the outputs are combined and linearly transformed to produce the final result. This approach allows the model to capture diverse relationships within the input sequence, enhancing its representational capacity.

As shown in Figure 3, the multi-head attention (MA) mechanism starts by applying scaled dot-product attention to the inputs V, Q, and K through linear transformations, computing each head separately. Although there are multiple heads with distinct parameters WWW for the linear transformations of Q, K, and V, it is collectively referred to as multi-head attention. Finally, the outputs of all heads are concatenated and linearly transformed to produce the final output of the MA mechanism.

The MA mechanism involves executing multiple self-attention operations on the initial input sequences

V

,

Q

, and

K

. Afterward, it concatenates the outcomes from each self-attention set and applies a single linear transformation to derive the final output. Specifically, its calculation formula is:

\{\begin{matrix} M u l t i H e a d (Q, K, V) = C i n c a t (h e a d_{1}, \dots, h e a d_{h}) W^{o} \\ w h e r e h e a d_{i} = A t t e n t i o n (Q W_{1}^{Q}, K W_{i}^{Q}, V W_{i}^{V}) \end{matrix}

(8)

The multi-head attention mechanism enables the model to focus on information from various subspaces at different positions, making it more effective than single self-attention.

3.4. The IGWO-MALSTM Model

To address the high cost and time-consuming nature of manually selecting hyperparameters for LSTM models, as well as their poor long-term forecasting capabilities. Building on the original GWO algorithm, this paper improves its initialization process. We propose the IGWO algorithm to optimize LSTM model hyperparameters, minimizing manual intervention and subjectivity in hyperparameter selection and enhancing its predictive capabilities. Additionally, a multi-head attention mechanism is introduced to adjust the weight allocation of gradients, better controlling their magnitude. Attention weights are incorporated at each time step of the LSTM to measure the correlation between the input and previous hidden states. These attention weights, calculated based on the similarity between the input and hidden states, enable more stable gradient flow during training. By introducing the attention mechanism, the gradient propagation path can be more precisely controlled, mitigating gradient explosion issues and enhancing LSTM’s long-term forecasting capabilities.

In summary, integrating the multi-head attention mechanism into LSTM and optimizing hyperparameters with IGWO addresses the gradient explosion issue, enhancing the model’s stability and predictive performance. This approach offers a novel method for improving LSTM networks’ ability to handle complex sequential data, with broad application potential.

Structure of IGWO-MALSTM

The architecture of the IGWO-MALSTM model primarily comprises an input layer, three LSTM layers, an attention layer, a dense layer, and an output layer. The key innovation of this model lies in the use of the IGWO algorithm to optimize critical hyperparameters, including lookback period, number of neurons, dropout rate, batch size, and epochs.

The IGWO algorithm serves as a crucial mechanism for optimizing neuron configurations within the LSTM and attention layers, including adjustments to neuron counts, structural hierarchy, and parameters. It employs an adaptive iterative process to select the optimal neuron configurations and parameters, thereby enhancing the network’s overall performance. Initially, data are processed through the LSTM layers, which handle time series by integrating forward and backward information. This functionality enables the model to capture long-range dependencies within the sequence. The output of the LSTM layers, consisting of contextual information for each point in the time series, is then passed to the subsequent multi-head attention layer.

In the multi-head attention layer, the model independently processes the LSTM layers’ output using multiple attention heads, each focusing on distinct feature subsets. This mechanism allows the model to discern the internal structure of the data across different representational subspaces, providing a more comprehensive understanding of the data. Through this process, the model extracts information from various perspectives, enhancing the representation of critical features and ultimately producing more accurate and relevant results.

The model first addresses temporal dependencies through the LSTM layers and then enhances features via the multi-head attention layer, integrating the strengths of both components to improve its ability to handle complex sequential data. The structure of the IGWO-MALSTM model is illustrated in Figure 4.

Figure 5 presents a simplified framework diagram of the IGWO-MALSTM model, illustrating its data processing workflow.

The IGWO-MALSTM model is constructed using algorithmically optimized parameters to enhance structural clarity and decision-making capacity. This optimization process highlights the rationale behind our parameter choices and model design, minimizing human intervention and improving the clarity of the model’s structure and parameters.

4. Experiments

In this paper, the IGWO-MALSTM model was evaluated against other models through comparative experiments and backtesting, confirming its effectiveness in capturing short-term dynamic changes and long-term dependencies in multidimensional time series data. The hardware and software configurations used in the experiments are detailed in Table 1.

4.1. Data

We gathered historical price data for rebar and HRC futures from the Shanghai Futures Exchange, covering the period from December 2020 to 15 March 2023. This dataset included 5 min K-line data comprising the open, close, high, and low prices for each contract. To construct a meaningful spread series, we calculated the price difference between the two futures types at each time point, forming a new time series used as the model input. A preliminary correlation analysis showed that the rebar and HRC futures exhibited a high correlation coefficient of 0.984, supporting their use as a meaningful spread pair for financial forecasting tasks.

To improve model robustness and generalizability, we also incorporated a second dataset consisting of CSI 300 Index 1 min K-line closing prices, spanning from January 2019 to December 2024. This high-frequency dataset provides finer temporal resolution and extends beyond the primary dataset’s training period, enabling evaluation under diverse market conditions and enhancing the model’s real-world applicability.

Both datasets were used independently for model training and testing across different experiments. The inclusion of diverse time frequencies and underlying instruments allows the proposed model to be assessed under varying levels of volatility, structural complexity, and noise—factors that are common in real-world financial time series.

Data Description and Preprocessing

The experiments in this study utilize two datasets to ensure the robustness and generalizability of the proposed model. The first dataset comprises historical 5 min K-line data for rebar and hot-rolled coil (HRC) futures, collected from the Shanghai Futures Ex-change. It spans from December 2020 to 15 March 2023, and includes the open, close, high, and low prices for each time interval. This period includes significant market events and structural changes post-COVID-19, such as supply chain disruptions, commodity price volatility, and macroeconomic policy shifts, making it particularly relevant for evaluating forecasting model robustness under non-stationary financial environments. In addition, the availability of high-frequency futures data beyond this point was constrained by data access limitations from the Shanghai Futures Exchange. To enhance model generalization, we also incorporated a second high-frequency dataset (CSI 300, January 2019–December 2024) which extends the evaluation range beyond the main training period.

To prevent data leakage and ensure fair evaluation, both datasets were chronologically split into training and testing sets. Specifically, 80% of each dataset was used for training, and the remaining 20% was reserved for testing.

To ensure reproducibility, the rebar and HRC 5 min K-line dataset contains 184,529 data points, with 147,623 used for training and 36,906 for testing. The CSI 300 1 min dataset includes 292,080 data points, from which 233,664 were used for training and 58,416 for testing. All splits were conducted in strict chronological order to avoid data leakage and preserve the temporal structure of financial time series. During IGWO-based hyperparameter tuning, 10% of the training data was further allocated as a validation set to guide model selection. No k-fold cross-validation was employed, as maintaining temporal order is essential for time series forecasting. This ensures that the model is evaluated exclusive.

As part of the preprocessing, the input features were normalized to facilitate model convergence. Table 2 summarizes the basic statistical properties of the spread data (used as model input), including the mean, standard deviation, and value range (minimum and maximum).

In each forecast, this experiment uses the previous trading day’s closing price to predict the next trading day’s closing price + 1. A moving window is constructed by using the observed time series to construct functions and labels. Figure 6 shows the process in detail.

4.2. Evaluation Criteria

To comprehensively assess the performance of the IGWO-MALSTM model, we utilized four key metrics: Mean Absolute Error (MAE), Mean Square Error (MSE), Mean Absolute Percentage Error (MAPE), and Coefficient of Determination (R²). Each of these metrics provides a unique perspective on the model’s predictive accuracy and its ability to capture both short-term and long-term trends in stock prices.

MAE is a metric that calculates the average absolute difference between predicted and actual values. It provides a straightforward measure of prediction accuracy, with smaller values indicating better model performance. MAE is calculated as follows:

M A E = \frac{1}{n} \sum_{i = 1}^{n} | {\hat{y}}_{i} - y_{i} |

(9)

MAPE calculates the average absolute percentage difference between predicted and actual values. It provides a relative measure of prediction accuracy, with smaller values indicating better model performance. MAPE is calculated as follows:

M A P E = \frac{1}{n} \sum_{i = 1}^{n} |\frac{y_{i} - {\hat{y}}_{i}}{y_{i}}| \times 100

(10)

MSE is a common metric for evaluating model performance, especially in regression. A smaller MSE indicates better model accuracy. However, MSE is sensitive to outliers, as squaring the differences amplifies their effect. MSE is calculated as follows:

M S E = \frac{1}{m} \sum_{i = 1}^{m} {(y_{i} - {\hat{y}}_{i})}^{2}

(11)

R^{2}

measures how well a regression model fits the data. An

R^{2}

value near 1 indicates a good fit, while a value near 0 suggests a poor fit.

R^{2}

is calculated as follows:

R^{2} = 1 - \frac{\sum_{i}^{n} {({\hat{y}}_{i} - y_{i})}^{2}}{\sum_{i}^{n} {({\bar{y}}_{i} - y_{i})}^{2}}

(12)

These metrics allow for a comprehensive evaluation of both prediction accuracy and the model’s ability to capture the underlying trends in financial data.

4.3. CEC 2019 Benchmark

This study employs the CEC2019 test suite to evaluate the performance of optimization bio-inspired algorithms. It is important to note that maximization and minimization problems can be converted into one another; therefore, all test functions are defined as minimization problems: Min f(x), where x represents the D-dimensional space variables, and f(x) denotes the objective function. The CEC2019 function set serves as an effective benchmark for testing the performance of metaheuristic algorithms. In CEC2019, the functions F1–F3 exhibit different dimensions and ranges, while the F4 function is formulated as a 10-dimensional minimization challenge. Given that many of the CEC2019 test functions are multimodal, they present substantial challenges.

As shown in Table 3, the benchmark test functions vary in dimension and search range, providing a robust evaluation environment for the proposed algorithm.

Each algorithm is independently evaluated on the CEC 2019 benchmark suite to assess its performance across various test cases. The algorithms are evaluated by ranking the average statistical outcomes of the benchmark set to identify their top performances or optimization abilities. Additionally, the robustness of the algorithms is evaluated by examining the standard deviation (SD) results for each benchmark set.

After evaluating the effectiveness of the model parameters, the improved GWO method was employed to optimize the model parameters. The parameters in the LSTM were used as the position vectors of the grey wolves, and the MAPE evaluation metric of model training was used as the fitness function. Upon executing the IGWO technique, optimal network model parameters were obtained. Figure 7 shows the convergence behavior of the GWO and MTGWO algorithms on the four benchmark test functions (CEC-01 to CEC-04).

4.4. Hyperparameter Selection

To fine-tune the hyperparameters of an MALSTM model using the IGWO algorithm, we followed this strategy: the hyperparameters, including the time step, neuron count in the three LSTM layers, dropout rate, epochs, and batch size, were treated as decision variables. The IGWO algorithm started with a group of 7 grey wolves, and each wolf had a search range of 7.

The fitness of each grey wolf was evaluated based on its hyperparameter values. The leading wolf was identified, and the grey wolves’ positions were adjusted using their relationships and a set formula. These positions were then translated back into corresponding hyperparameter values, and the MALSTM model was reconstructed and trained using these values. This iterative process continued until a stopping condition was met. Finally, the best combination of hyperparameters, determined by the grey wolf individual with the highest fitness, was utilized to optimize the performance of the MALSTM model.

To ensure efficient resource utilization during the search process, it was essential to establish a reasonable search range for the parameters. We conducted an analysis of several research papers focusing on MALSTM based stock market prediction in China, along with relevant literature. These investigations thoroughly adjusted and tested parameter values, producing results that met researchers’ expectations. As a result, the parameters used in these studies provided valuable guidance for setting the search range of our parameters.

Most of these studies maintained the number of neurons in each MALSTM layer below 100 and the time step below 10. Additionally, many studies set the number of epochs around 100 or 200, with few opting for higher values. Taking these considerations into account, we established the appropriate parameter search ranges as follows: the number of neurons in each LSTM layer between 30 and 100, the time step between 1 and 10, and the batch size between 9000 and 10,000. After numerous tests, we found that using more than 100 epochs did not significantly affect our experiments. Given the limitations of the experimental setup, we set the search range for epochs to [50, 100].

Within the defined search range for optimizing objective parameters, the IGWO algorithm was utilized to determine the optimal parameters. Figure 8 and Table 4 depict the optimization process for each parameter. Analysis revealed that the algorithm’s fitness value reached its minimum after the third iteration and remained unchanged thereafter. Similarly, the values of the objective parameters stabilized after the third iteration, indicating that IGWO had identified the optimal parameters for the model.

Table 5 presents the optimal parameters derived by the IGWO for the MALSTM model. These parameters consist of a lookback value of 8, with the number of neurons in each of the three LSTM layers set to 98, 88, and 66, a dropout rate of 0.133489, a batch size of 9583, and an epoch value of 98. These optimal values were determined through the IGWO optimization process applied to a stock prediction model.

To ensure a fair and unbiased comparison, all baseline models—including CS-LSTM, PSO-LSTM, Transformer, and LSTM-Transformer—were subjected to systematic hyperparameter tuning. For traditional models (e.g., CS-LSTM and PSO-LSTM), we performed a grid search over commonly used parameter ranges (such as number of layers, neuron count, batch size, and learning rate), guided by prior literature and validation performance. For Transformer-based models, we adopted optimization strategies aligned with recent best practices, particularly those outlined in Vuković et al. (2024) [56], which suggested optimizing key parameters such as learning rate, batch size, and hidden layer sizes (HS). Specifically, we used the AdaGrad optimization method, implemented in TensorFlow 2.13.0, with varying hidden layer sizes to balance convergence speed and accuracy, as recommended by Vuković et al. (2024) [56]. These efforts ensure consistent training protocols and evaluation metrics across all models, allowing us to attribute observed performance gains to the proposed IGWO-MALSTM architecture and optimization strategy, rather than experimental bias.

4.5. Experimental Results and Analysis

To rigorously assess the predictive performance of the proposed IGWO-MALSTM model, a series of comparative experiments were conducted utilizing a proprietary dataset comprising the price spreads between rebar and hot-rolled coil futures. The experimental framework incorporated four recent deep learning models—CS-LSTM, PSO-LSTM, Transformer, LSTM-Transformer—as benchmark baselines. All models were trained and evaluated under identical experimental conditions. The performance of each model was quantitatively measured using three widely recognized evaluation metrics: MSE, MAE, MAPE, and R².

As evidenced in Table 6, the IGWO-MALSTM model consistently outperforms all baseline models across various evaluation metrics. Quantitatively, it achieved an MSE of 2.63, an MAE of 1.091, a MAPE of 2.79%, and an R² of 0.997 on the rebar and HRC futures dataset. Relative to the standard LSTM (MSE: 4.52, MAE: 1.587, MAPE: 4.47%), the model yielded a 41.8% reduction in MSE, a 31.3% reduction in MAE, and a 37.6% reduction in MAPE. Compared to GRU (MSE: 5.60, MAE: 1.785, MAPE: 5.52%), IGWO-MALSTM reduced the MSE by 53.0%, the MAE by 38.9%, and the MAPE by 49.5%. In addition, it demonstrated superior convergence behavior and predictive accuracy compared to CS-LSTM (MSE: 2.91, MAE: 1.163, MAPE: 2.85%), underscoring the enhanced optimization capability of the IGWO algorithm over conventional swarm-based techniques. While IGWO-MALSTM requires a slightly longer training time (65.15 min), this is acceptable in financial applications, where training is typically conducted offline and inference time remains minimal. When benchmarked against state-of-the-art architectures such as Transformer and LSTM-Transformer, the proposed approach achieved comparable or better performance—particularly in reducing cumulative forecast error over longer horizons, where traditional models often struggle with gradient degradation. These results highlight the robustness and practical utility of the IGWO-MALSTM framework for complex financial time series forecasting tasks.

On the volatile, high-frequency CSI 300 1 min dataset, IGWO-MALSTM again delivered the best overall performance, with an MSE of 4.17, MAE of 3.466, MAPE of 6.92%, and R² of 0.996. Compared to the LSTM baseline, it reduced MSE by 59.7%, MAE by 67.1%, and MAPE by 42.4%. Against the Transformer model, it achieved improvements of 51.0%, 62.2%, and 29.7%, respectively. These results highlight IGWO-MALSTM’s strong generalization across varying frequencies and data scales, making it highly suitable for both low- and high-frequency financial forecasting.

Figure 9 further illustrates the superior performance of the IGWO-MALSTM model through visual analysis. The predicted curves indicate that the model is capable of closely tracking both the short-term fluctuations and long-term trends of the futures price spread series. Notably, even as the prediction window increases—a condition that typically leads to performance degradation—IGWO-MALSTM maintains a relatively stable error profile. This robustness can be attributed to the synergy between the IGWO algorithm and the multi-head self-attention mechanism: the former effectively optimizes the hyperparameter selection of LSTM, while the latter dynamically emphasizes critical time steps and reduces sensitivity to irrelevant or noisy inputs.

In addition to the architectural advantages of the model, its optimization strategy plays a critical role in enhancing performance. The integration of IGWO enables adaptive and efficient hyperparameter tuning through the use of the Mersenne Twister. Unlike the conventional GWO algorithm, IGWO ensures a more uniform distribution during population initialization, facilitating better exploration of high-dimensional parameter spaces. This prevents premature convergence and contributes to improved generalization capability and model stability.

To verify the robustness of the reported performance improvements, we conducted statistical significance testing across multiple random splits of the dataset. Specifically, we performed paired t-tests and Wilcoxon signed-rank tests comparing the IGWO-MALSTM model to each baseline model on key evaluation metrics (MSE, MAE, MAPE, R²). For each model, we repeated training and testing over 10 randomized experiments, using different seeds and dataset splits while maintaining the same training ratio.

Results showed that the performance differences between IGWO-MALSTM and all other baseline models were statistically significant at the 95% confidence level (p < 0.05) for all metrics. Table 7 summarizes the p-values for these comparisons.

4.5.1. Comparison with Other Optimizers

To evaluate the optimization performance of the proposed Improved IGWO, we compared it against three widely used metaheuristic optimization algorithms: Differential Evolution (DE) [57], Genetic Algorithm (GA) [58], and Bayesian Optimization (BO) [59]. These algorithms were selected due to their established effectiveness in hyperparameter tuning tasks across various domains, including time series forecasting and neural network training.

In all experiments, the optimizers were assigned to tune the same set of MALSTM hyperparameters, including the lookback window size, number of neurons in the three LSTM layers, dropout rate, batch size, and number of training epochs. To ensure a fair comparison, identical search ranges were defined for all algorithms, and the fitness function was set as the Mean Absolute Percentage Error (MAPE) calculated on the validation set. Each optimizer was executed for up to 50 iterations or until convergence, and model performance was assessed on the same test dataset.

The experimental results are summarized in Table 8 IGWO-MALSTM consistently achieved the best performance across all evaluation metrics, demonstrating superior predictive accuracy and faster convergence. Although BO also showed competitive results, its training process was more time-consuming due to the sequential nature of surrogate modeling. In contrast, both DE and GA exhibited higher error rates and slower convergence, which could be attributed to premature convergence or limited exploration in the high-dimensional parameter space.

These findings indicate that IGWO not only enhances the predictive capability of the MALSTM model but also provides improved stability and convergence efficiency. The enhanced population initialization based on the Mersenne Twister generator further strengthens IGWO’s global search ability and reduces the risk of being trapped in local optima.

4.5.2. Ablation Study

To assess the individual contributions of the IGWO optimization algorithm and the multi-head attention mechanism to the overall performance of the proposed IGWO-MALSTM model, we conducted a systematic ablation study. We designed four model variants using the same dataset (closing prices of rebar and HRC futures from the Shanghai Futures Exchange between December 2020 and 15 March 2023), identical training/testing splits (80% training, 20% testing), and consistent training procedures: LSTM, IGWO-LSTM, MALSTM and IGWO-MALSTM.

As shown in Table 9, the base LSTM model yielded an MSE of 4.52 and an MAE of 1.587. Adding IGWO optimization improved both metrics to 3.14 and 1.383, highlighting the importance of effective hyperparameter tuning. In contrast, the MALSTM model, which incorporates multi-head attention but lacks parameter optimization, performed worse with an MSE of 5.37 and an MAE of 1.982, indicating that attention mechanisms alone cannot compensate for suboptimal configurations.

The full IGWO-MALSTM model achieved the best results, with the lowest MSE of 2.63, MAE of 1.091, and MAPE of 2.79%, along with the highest R² score of 0.997. These findings demonstrate that IGWO and the multi-head attention mechanism work synergistically to enhance predictive performance.

4.5.3. Model Interpretability with SHAP

To better understand the decision-making process of the IGWO-MALSTM model, SHAP (SHapley Additive exPlanations) was employed (implemented using the SHAP Python package, version 0.41.0) to analyze feature importance. As shown in Figure 10, features from the most recent time step (e.g., 9_High, 9_Open, 9_Low) have the most significant impact on the model’s prediction of closing price. This indicates the model’s reliance on recent price movements. In contrast, technical indicators such as MACD, DEA, and DIF show relatively low contribution, suggesting that the model prioritizes raw price information over derived indicators. Moreover, the color distribution reveals that both high and low values of a feature can influence predictions in different directions, highlighting the nonlinear nature of the model’s decision boundaries.

4.6. Backtesting Experiment

To validate the performance of IGWO-MALSTM in real-world trading scenarios, this study conducts a backtesting comparison using three strategy models. Correlated futures data of rebar and hot-rolled coil were selected, with the 5 min K-line data of rebar used for model training. The backtesting was performed on the 5 min K-line data of hot-rolled coil from 2019 to 2024. Backtesting uses entirely independent data (January 2019–December 2024, 5 min K-line) not overlapping with the training period (December 2020–March 2023). This ensures the trading strategy is evaluated on unseen market regimes. The three strategies include the original Rbreaker algorithm, the Rbreaker-LSTM-Attention strategy (Rbreaker-LA), and the proposed Rbreaker-IGWO-MALSTM strategy.

In the Rbreaker-LA strategy, an LSTM-Attention model was introduced to predict market trends (i.e., price increase or decrease), providing auxiliary signals to enhance the decision-making process of the Rbreaker strategy. Building upon this, the Rbreaker-IGWO-MALSTM strategy incorporates the IGWO-MALSTM to perform adaptive optimization of key strategy parameters. Specifically, IGWO-MALSTM was used to fine-tune the threshold values involved in calculating the breakout buy price, observation sell price, reversal sell price, reversal buy price, observation buy price, and breakout sell price—aiming to enhance the overall profitability and robustness of the strategy.

The backtesting results are summarized in Table 10 and further analyzed in the following discussion. According to the data, the original Rbreaker algorithm achieved the lowest maximum drawdown among all strategies, highlighting its strong risk control capability and ability to mitigate potential losses during adverse market conditions. When integrating the LSTM-Attention mechanism, the Rbreaker-LA strategy improved both the return rate and the Sharpe ratio, although this came with a slight increase in drawdown. The enhanced return reflects greater profitability, while the higher Sharpe ratio indicates improved risk-adjusted performance.

Further improvements were observed with the Rbreaker-IGWO-MALSTM strategy. Although it incurred a slightly higher drawdown than the original Rbreaker, it achieved significant gains in both return and Sharpe ratio. This demonstrates that the integration of a robust optimization algorithm like IGWO-MALSTM can lead to superior performance by effectively balancing risk and reward. The findings also confirm that optimization algorithms such as IGWO-MALSTM are not only suitable for tuning hyperparameters in machine learning models, but can also be effectively applied to parameter tuning in financial trading strategies—offering high adaptability and practical value in quantitative investment.

5. Conclusions

This study introduces IGWO-MALSTM, a hybrid deep learning framework that combines Improved Grey Wolf Optimization (IGWO) and a multi-head attention-enhanced LSTM for financial time series forecasting. The model addresses two critical challenges in conventional LSTM-based forecasting: (1) the inefficiency and subjectivity of manual hyperparameter tuning, and (2) the limited ability to capture long-term temporal dependencies due to gradient-related issues.

By enhancing the population initialization process using the Mersenne Twister in IGWO, the proposed approach significantly improves the optimizer’s search diversity and convergence stability. Simultaneously, the integration of a multi-head attention mechanism allows the model to dynamically focus on salient temporal features across different subspaces, thereby mitigating gradient vanishing/explosion and boosting long-term forecasting accuracy.

Extensive experiments on both low-frequency (futures spread) and high-frequency (CSI 300 index) datasets demonstrate that IGWO-MALSTM consistently outperforms a wide range of benchmark models, including RNN, GRU, LSTM, PSO-LSTM, CS-LSTM, Transformer, and LSTM-Transformer. The model achieved up to 61.45% reduction in MSE and 44.53% reduction in MAE, with statistically significant improvements confirmed by t-tests and Wilcoxon tests. Moreover, the ablation study verified that both IGWO and multi-head attention contribute substantially to performance gains, and their combined use achieves optimal results.

From a practical perspective, the model’s effectiveness was further validated through backtesting on real futures trading data, where IGWO-MALSTM-integrated strategies significantly outperformed traditional and LSTM-enhanced trading algorithms in terms of return rate, Sharpe ratio, and risk-adjusted profitability. These findings not only demonstrate the model’s predictive strength but also underscore its potential for real-world financial applications, particularly in algorithmic trading and quantitative investment strategies.

Future work will focus on enhancing model generalization across asset classes, incorporating external macroeconomic indicators and sentiment analysis, and extending IGWO to support online learning or dynamic strategy adaptation in rapidly changing markets.

Author Contributions

M.Z., conceptualization, methodology, and writing—original draft and supervision; H.Q., writing—original draft and supervision; P.Q., formal analysis. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Key Research Project of Henan Province (231111210500). “Double first-class” discipline creation Project of Surveying and Mapping Science and Technology: GCCYJKT202513.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

Authors Mingfu Zhu and Panke Qin were employed by the company Hebi National Optoelectronic Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Fama, E.F. Efficient capital markets. J. Financ. 1970, 25, 383–417. [Google Scholar] [CrossRef]
Mittnik, S.; Robinzonov, N.; Spindler, M. Stock market volatility: Identifying major drivers and the nature of their impact. J. Bank. Financ. 2015, 58, 1–14. [Google Scholar] [CrossRef]
Uddin, M.A.; Hoque, M.E.; Ali, M.H. International economic policy uncertainty and stock market returns of Bangladesh: Evidence from linear and nonlinear model. Quant. Financ. Econ. 2020, 4, 236–251. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Cover, T.; Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef]
Tsay, R.S. Analysis of Financial Time Series; John Wiley and Sons: Hoboken, NJ, USA, 2005. [Google Scholar]
Zheng, X.; Xu, N.; Trinh, L.; Wu, D.; Huang, T.; Sivaranjani, S.; Liu, Y.; Xie, L. A multi-scale time-series dataset with benchmark for machine learning in decarbonized energy grids. Sci. Data 2022, 9, 359. [Google Scholar] [CrossRef] [PubMed]
Stasiak, M.D.; Staszak, Ż.; Siwek, J.; Wojcieszak, D. Application of State Models in a Binary–Temporal Representation for the Prediction and Modelling of Crude Oil Prices. Energies 2025, 18, 691. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Graves, A.; Graves, A. Supervised Sequence Labelling; Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
Fischer, T.; Krauss, C. Deep learning with long short-term memory networks for financial market predictions. Eur. J. Oper. Res. 2018, 270, 654–669. [Google Scholar] [CrossRef]
Sezer, O.B.; Gudelek, M.U.; Ozbayoglu, A.M. Financial time series forecasting with deep learning: A systematic literature review: 2005–2019. Appl. Soft Comput. 2020, 90, 106181. [Google Scholar] [CrossRef]
Chen, K.; Zhou, Y.; Dai, F. A LSTM-based method for stock returns prediction: A case study of China stock market. In Proceedings of the 2015 IEEE International Conference on Big Data (Big Data), Santa Clara, CA, USA, 29 October–1 November 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 2823–2824. [Google Scholar]
Bengio, Y.; Simard, P.; Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 1994, 5, 157–166. [Google Scholar] [CrossRef] [PubMed]
Xie, H.; Zhang, L.; Lim, C.P. Evolving CNN-LSTM models for time series prediction using enhanced grey wolf optimizer. IEEE Access 2020, 8, 161519–161541. [Google Scholar] [CrossRef]
Lipton, Z.C.; Berkowitz, J.; Elkan, C. A critical review of recurrent neural networks for sequence learning. arXiv 2015, arXiv:1506.00019. [Google Scholar]
Gers, F.A.; Schmidhuber, J.; Cummins, F. Learning to forget: Continual prediction with LSTM. Neural Comput. 2000, 12, 2451–2471. [Google Scholar] [CrossRef] [PubMed]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Sharma, M.; Kaur, P. A comprehensive analysis of nature-inspired meta-heuristic techniques for feature selection problem. Arch. Comput. Methods Eng. 2021, 28, 1103–1127. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Hamilton, J.D. Time Series Analysis; Princeton University Press: Princeton, NJ, USA, 2020. [Google Scholar]
Sims, C.A. Macroeconomics and reality. Econom. J. Econom. Soc. 1980, 48, 1–48. [Google Scholar] [CrossRef]
Pesaran, M.H.; Shin, Y. An Autoregressive Distributed Lag Modelling Approach to Cointegration Analysis; Department of Applied Economics, University of Cambridge: Cambridge, UK, 1995. [Google Scholar]
Engle, R.F. Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom inflation. Econom. J. Econom. Soc. 1982, 50, 987–1007. [Google Scholar] [CrossRef]
Bollerslev, T. Generalized autoregressive conditional heteroskedasticity. J. Econom. 1986, 31, 307–327. [Google Scholar] [CrossRef]
Ghysels, E.; Santa-Clara, P.; Valkanov, R. Predicting volatility: Getting the most out of return data sampled at different frequencies. J. Econom. 2006, 131, 59–95. [Google Scholar] [CrossRef]
Zhang, G.P. Time series forecasting using a hybrid ARIMA and neural network model. Neurocomputing 2003, 50, 159–175. [Google Scholar] [CrossRef]
Zhang, Y.; Lin, H.; Yang, Z.; Wang, J.; Zhang, S.; Sun, Y.; Yang, L. A hybrid model based on neural networks for biomedical relation extraction. J. Biomed. Inform. 2018, 81, 83–92. [Google Scholar] [CrossRef]
Patel, J.; Shah, S.; Thakkar, P.; Kotecha, K. Predicting stock and stock price index movement using trend deterministic data preparation and machine learning techniques. Expert Syst. Appl. 2015, 42, 259–268. [Google Scholar] [CrossRef]
Atsalakis, G.S.; Valavanis, K.P. Surveying stock market forecasting techniques—Part II: Soft computing methods. Expert Syst. Appl. 2009, 36, 5932–5941. [Google Scholar] [CrossRef]
Bao, W.; Yue, J.; Rao, Y. A deep learning framework for financial time series using stacked autoencoders and long-short term memory. PLoS ONE 2017, 12, e0180944. [Google Scholar] [CrossRef]
Wang, J.Z.; Wang, J.J.; Zhang, Z.G.; Guo, S.P. Forecasting stock indices with back propagation neural network. Expert Syst. Appl. 2011, 38, 14346–14355. [Google Scholar] [CrossRef]
Nti, I.K.; Adekoya, A.F.; Weyori, B.A. Efficient stock-market prediction using ensemble support vector machine. Open Comput. Sci. 2020, 10, 153–163. [Google Scholar] [CrossRef]
Tsantekidis, A.; Passalis, N.; Tefas, A.; Kanniainen, J.; Gabbouj, M.; Iosifidis, A. Forecasting stock prices from the limit order book using convolutional neural networks. In Proceedings of the 2017 IEEE 19th Conference on Business Informatics (CBI), Thessaloniki, Greece, 24–27 July 2017; IEEE: Piscataway, NJ, USA, 2017; Volume 1, pp. 7–12. [Google Scholar]
Heaton, J.B.; Polson, N.G.; Witte, J.H. Deep learning for finance: Deep portfolios. Appl. Stoch. Models Bus. Ind. 2017, 33, 3–12. [Google Scholar] [CrossRef]
Huang, J.; Chai, J.; Cho, S. Deep learning in finance and banking: A literature review and classification. Front. Bus. Res. China 2020, 14, 13. [Google Scholar] [CrossRef]
Zhang, X. Financial Time Series Forecasting Based on LSTM Neural Network optimized by Wavelet Denoising and Whale Optimization Algorithm. Acad. J. Comput. Inf. Sci. 2022, 5, 1–9. [Google Scholar]
Chong, E.; Han, C.; Park, F.C. Deep learning networks for stock market analysis and prediction: Methodology, data representations, and case studies. Expert Syst. Appl. 2017, 83, 187–205. [Google Scholar] [CrossRef]
Nelson, D.M.Q.; Pereira, A.C.M.; De Oliveira, R.A. Stock market’s price movement prediction with LSTM neural networks. In Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1419–1426. [Google Scholar]
Ding, G.; Qin, L. Study on the prediction of stock price based on the associated network model of LSTM. Int. J. Mach. Learn. Cybern. 2020, 11, 1307–1317. [Google Scholar] [CrossRef]
Rasheed, J.; Jamil, A.; Hameed, A.A.; Ilyas, M.; Özyavaş, A.; Ajlouni, N. Improving stock prediction accuracy using cnn and lstm. In Proceedings of the 2020 International Conference on Data Analytics for Business and Industry: Way Towards a Sustainable Economy (ICDABI), Sakheer, Bahrain, 26–27 October 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–5. [Google Scholar]
Yu, Y.; Si, X.; Hu, C.; Zhang, J. A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput. 2019, 31, 1235–1270. [Google Scholar] [CrossRef]
Staudemeyer, R.C.; Morris, E.R. Understanding LSTM—A tutorial into long short-term memory recurrent neural networks. arXiv 2019, arXiv:1909.09586. [Google Scholar]
Kennedy, J.; Eberhart, R. Particle swarm optimization. In Proceedings of the ICNN’95—International Conference on Neural Networks, Perth, Australia, 27 November–1 December 1995; IEEE: Piscataway, NJ, USA, 1995; Volume 4, pp. 1942–1948. [Google Scholar]
Wang, Y.; Jing, C.; Xu, S.; Guo, T. Attention based spatiotemporal graph attention networks for traffic flow forecasting. Inf. Sci. 2022, 607, 869–883. [Google Scholar] [CrossRef]
Dorigo, M.; Birattari, M.; Stutzle, T. Ant colony optimization. IEEE Comput. Intell. Mag. 2007, 1, 28–39. [Google Scholar] [CrossRef]
Yang, X.S.; Hossein Gandomi, A. Bat algorithm: A novel approach for global engineering optimization. Eng. Comput. 2012, 29, 464–483. [Google Scholar] [CrossRef]
Wu, Z.; Cui, N.; Zhang, W.; Liu, C.; Jin, X.; Gong, D.; Xing, L.; Zhao, L.; Wen, S.; Yang, Y. Estimating soil moisture content in citrus orchards using multi-temporal sentinel-1A data-based LSTM and PSO-LSTM models. J. Hydrol. 2024, 637, 131336. [Google Scholar] [CrossRef]
Han, L.; Wang, X.; Yu, Y.; Wang, D. Power Load Forecast Based on CS-LSTM Neural Network. Mathematics 2024, 12, 1402. [Google Scholar] [CrossRef]
Andayani, F.; Theng, L.B.; Tsun, M.T.; Chua, C. Hybrid LSTM-transformer model for emotion recognition from speech audio files. IEEE Access 2022, 10, 36018–36027. [Google Scholar] [CrossRef]
Oreshkin, B.N.; Carpov, D.; Chapados, N.; Bengio, Y. N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. arXiv 2019, arXiv:1905.10437. [Google Scholar]
Kabir, M.R.; Bhadra, D.; Ridoy, M.; Milanova, M. LSTM–Transformer-Based Robust Hybrid Deep Learning Model for Financial Time Series Forecasting. Sci 2025, 7, 7. [Google Scholar] [CrossRef]
Yu, H. Comparative analysis of LSTM and transformer-based models for stock price forecasting. In Proceedings of the AIP Conference Proceedings, Shanghai, China, 26–28 January 2024; AIP Publishing: Melville, NY, USA, 2024; Volume 3194. [Google Scholar]
Vuković, D.B.; Radenković, S.D.; Simeunović, I.; Zinovev, V.; Radovanović, M. Predictive patterns and market efficiency: A deep learning approach to financial time series forecasting. Mathematics 2024, 12, 3066. [Google Scholar] [CrossRef]
Feoktistov, V. Differential Evolution; Springer: New York, NY, USA, 2006. [Google Scholar]
Mirjalili, S.; Mirjalili, S. Genetic algorithm. In Evolutionary Algorithms and Neural Networks: Theory and Applications; Springer: Berlin/Heidelberg, Germany, 2019; pp. 43–55. [Google Scholar]
Frazier, P.I. Bayesian optimization. In Recent Advances in Optimization and Modeling of Contemporary Problems; Informs: Catonsville, MD, USA, 2018; pp. 255–278. [Google Scholar]

Figure 1. Neural unit structure of LSTM.

Figure 2. The IGWO flowchart.

Figure 3. The structure of the self-attention mechanism.

Figure 4. Structural diagram of IGWO-MALSTM.

Figure 5. Steps of the IGWO-MALSTM.

Figure 6. Illustration of the walk-forward evaluation strategy using a moving window approach. Each blue segment represents the input sequence used for training, while the following green segment represents the corresponding prediction targets. After each iteration, the window is shifted forward in time. This method ensures that predictions are made only on future, unseen data, closely simulating a real-world forecasting scenario.

Figure 7. Test functions.

Figure 8. Optimization iterative process of IGWO-MALSTM.

Figure 9. Comparison between prediction results of each model.

Figure 10. SHAP summary plot showing feature importance and impact on model output.

Table 1. Experimental environment.

Items	Python Version	Memory	GPU	System	CUDA
Parameter	Python 3.9	16G	GTX 4060ti	Windows	CUDA12.4

Table 2. Rebar and HRC futures basic statistical properties of closing price data.

Static	Value
Mean	−166.96
Standard Deviation	122.07
Minimum	−571
Maximum	108

Table 3. Lists the four common test functions used in this study.

Func.	Descriptions	Dim	Range	$f_{m i n}$
CEC-01	Storn’s Chebyshev Polynomial Fitting Problem	9	[−8192, 8192]	1
CEC-02	Inverse Hilbert Matrix Problem	16	[−16,384, 16,384]	1
CEC-03	Lennard-Jones Minimum Energy Cluster	18	[−4, 4]	1
CEC-04	Rastrigin’s Function	10	[−100, 100]	1

Table 4. Optimization iterative process of IGWO-MALSTM.

Algorithm Iterations	Look Back	Neuron1	Neuron2	Neuron3	Dropout	Batch_Size	Epochs	Algorithm Iterations
1	1	96	89	82	0.379	9227	84	1
2	8	38	54	71	0.186	9289	85	2
3	9	45	84	70	0.212	9012	52	3
4	7	97	70	46	0.334	9294	88	4
5	8	47	92	72	0.268	9606	74	5

Table 5. Results of hyperparametric optimization.

Parameter	Search Range	Optimal Value
look_back	[1, 10]	1
neuron1	[30, 100]	98
neuron2	[30, 100]	88
neuron3	[30, 100]	66
dropout	[0.0002, 0.5]	0.133489
batch_size	[9000, 10,000]	9583
epochs	[50, 100]	98

Table 6. Evaluation results of models.

Data	Models	MSE	MAE	MAPE	R²	Training Time (min)
rebar and HRC futures	RNN	6.82	1.965	6.51%	0.988	10.41
	GRU	5.60	1.785	5.52%	0.992	13.45
	LSTM	4.52	1.587	4.47%	0.993	15.71
	BILSTM	4.03	1.446	4.36%	0.994	16.15
	PSO-LSTM	3.61	1.346	3.21%	0.994	58.31
	CS-LSTM	2.92	1.163	3.18%	0.996	60.14
	Transformer	2.67	1.123	3.82%	0.996	25.36
	N-BETAS	2.64	1.102	4.02%	0.996	33.76
	LSTM-Transformer	2.75	1.233	3.79%	0.993	27.90
	IGWO-MALSTM	2.63	1.091	3.70%	0.997	65.15
CSI 300	RNN	11.79	10.817	11.19%	0.988	21.14
	GRU	10.67	10.387	11.13%	0.991	25.99
	LSTM	10.36	10.538	10.21%	0.989	24.45
	BILSTM	10.21	10.228	9.36%	0.987	20.60
	PSO-LSTM	9.26	9.964	9.36%	0.991	119.16
	CS-LSTM	9.76	9.222	9.33%	0.993	111.36
	Transformer	8.51	9.415	8.65%	0.993	28.47
	N-BETAS	9.96	6.159	9.32%	0.991	36.15
	LSTM-Transformer	6.48	4.595	8.45%	0.993	36.52
	IGWO-MALSTM	4.17	3.466	6.92%	0.996	116.36

Table 7. Statistical significance (p-values) between IGWO-MALSTM and baseline models across 10 runs.

Model	MSE (p)	MAE (p)	MAPE (p)	R² (p)
CS-LSTM	0.0012	0.0045	0.0031	0.0089
PSO-LSTM	0.0007	0.0023	0.0028	0.0102
Transformer	0.0154	0.0432	0.0305	0.0498
LSTM-Transformer	0.0271	0.0489	0.0364	0.0406

Table 8. Comparative performance of IGWO and other optimization algorithms in MALSTM hyperparameter tuning.

Models	MSE	MAE	MAPE	R²
IGWO-MALSTM	2.63	1.091	3.70%	0.997
DE-MALSTM	3.12	1.254	4.01%	0.995
GA-MALSTM	4.05	1.615	5.26%	0.993
BO-MALSTM	4.74	1.667	6.04%	0.993

Table 9. Ablation study results on Rebar and HRC futures data (Shanghai Futures Exchange, December 2020–March 2023).

Data	Models	MSE	MAE	MAPE	R²
rebar and HRC futures	LSTM	4.52	1.587	4.47%	0.993
	IGWO-LSTM	3.14	1.383	2.96%	0.994
	MALSTM	5.37	1.982	3.89%	0.992
	IGWO-MALSTM	2.63	1.091	3.70%	0.997

Table 10. Comparison of backtesting results.

Indicator	Rbreaker	Rbreaker-LA	Rbreaker-IGWO-MALSTM
Initial Capital	50,000	50,000	50,000
Ending Capital	61,772.3	71,965.3	168,233.7
Total Profit/Loss	11,772.3	21,965.3	118,233.7
Average Profit/Loss	60.4	264.6	185.6
Return Rate	23.3%	43.9	212.3%
Annualized Return Rate	4.61%	8.78%	42.6%
Total Number of Trades	195	83	637
Total Number of Profitable Trades	118	50	387
Average Profit	214.4	839.5	1118.8
Maximum Profit	1217.1	3955.1	12,121.1
Number of Losing Trades	77	33	250
Average Loss	−175.6	−606.4	−1259.0
Maximum Loss	−1507.4	−2595.9	−7537.1
Maximum Drawdown Ratio	0.39%	1.38%	3.66%
Maximum Drawdown Amount	197.46	669.98	2209.66
Sharpe Ratio	−1.9853	−0.34	1.12

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhu, M.; Qi, H.; Qin, P. IGWO-MALSTM: An Improved Grey Wolf-Optimized Hybrid LSTM with Multi-Head Attention for Financial Time Series Forecasting. Appl. Sci. 2025, 15, 6619. https://doi.org/10.3390/app15126619

AMA Style

Zhu M, Qi H, Qin P. IGWO-MALSTM: An Improved Grey Wolf-Optimized Hybrid LSTM with Multi-Head Attention for Financial Time Series Forecasting. Applied Sciences. 2025; 15(12):6619. https://doi.org/10.3390/app15126619

Chicago/Turabian Style

Zhu, Mingfu, Haoran Qi, and Panke Qin. 2025. "IGWO-MALSTM: An Improved Grey Wolf-Optimized Hybrid LSTM with Multi-Head Attention for Financial Time Series Forecasting" Applied Sciences 15, no. 12: 6619. https://doi.org/10.3390/app15126619

APA Style

Zhu, M., Qi, H., & Qin, P. (2025). IGWO-MALSTM: An Improved Grey Wolf-Optimized Hybrid LSTM with Multi-Head Attention for Financial Time Series Forecasting. Applied Sciences, 15(12), 6619. https://doi.org/10.3390/app15126619

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

IGWO-MALSTM: An Improved Grey Wolf-Optimized Hybrid LSTM with Multi-Head Attention for Financial Time Series Forecasting

Abstract

1. Introduction

2. Related Work

2.1. Traditional Methods

2.2. Deep Learning Approaches

2.3. Limitations of LSTM

3. Methodology

3.1. The LSTM Model

3.2. Grey Wolf Optimization

3.2.1. Key Steps in the Grey Wolf Optimizer Algorithm

3.2.2. Improved Grey Wolf Optimization

3.3. The Multi-Head Attention Mechanism

3.4. The IGWO-MALSTM Model

Structure of IGWO-MALSTM

4. Experiments

4.1. Data

Data Description and Preprocessing

4.2. Evaluation Criteria

4.3. CEC 2019 Benchmark

4.4. Hyperparameter Selection

4.5. Experimental Results and Analysis

4.5.1. Comparison with Other Optimizers

4.5.2. Ablation Study

4.5.3. Model Interpretability with SHAP

4.6. Backtesting Experiment

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI