3.1. The LSTM Model
LSTM consists of three layers: input, hidden, and output. In contrast to standard neural networks, LSTM hidden layer neurons are linked to each other. This interconnectedness fosters correlations between data points, enabling the model to grasp intricate patterns and dependencies within the input sequence. To tackle this challenge, LSTMs employ three essential gates: a forgetting gate, an input gate, and an output gate. These gates meticulously manage the flow of information, ensuring that vital long-term information is retained while irrelevant input data are discarded. Visually, the LSTM model resembles a chain-like structure, as illustrated in
Figure 1, comprising repeating basic units known as memory cells.
The input gate
controls the flow of input data into the cell state, deciding what information needs updating and what is essential for accurate predictions. Meanwhile, the Forgetting Gate
filters out unnecessary information from the previous step and updates the cell state accordingly. Lastly, the output gate
determines the output value to be transmitted to the next hidden cell. Here are the equations governing the gates in an LSTM:
is the hidden state at time t and represents the neuron’s “memory” at that moment. The input at time t is , while is either the hidden state at time t − 1 or the initial hidden state at time 0. The input, forgetting, unit, and output gates are denoted by , , , and , respectively. Here, represents the sigmoid function, and * signifies element-wise vector multiplication.
3.2. Grey Wolf Optimization
The GWO algorithm is inspired by the hunting tactics of gray wolves. It approximates solutions for optimization and search problems by mimicking the wolves’ behaviors of encircling, chasing, and attacking their prey. This algorithm simulates the collaborative and competitive behaviors of wolf packs during hunting, evolving solutions through generations. It comprises three core phases: encircling, hunting, and attacking.
(1) Encircling Prey: Gray wolves slowly close in on their prey by encircling it, and this behavior is mathematically modeled as follows:
In this context, denotes the prey’s position, while is the position vector of the gray wolf. The current iteration number is t. Coefficient vectors A and C are used, along with random vectors and ranging between 0 and 1. Throughout the iterations, the vector elements linearly decrease from 2 to 0.
(2) Hunting: In the hunting phase, gray wolves can detect potential prey locations (optimal solutions), with
α,
β, and
δ wolves guiding the search. Since the solution space often has unknown characteristics, the exact position of the optimal solution may be unclear. To mimic the wolves’ search behavior, it is presumed that the
α,
β, and
δ wolves excel at locating potential prey. Thus, in each iteration, the top three wolves (
α,
β,
δ) are preserved, and the positions of other search agents, including
ω, are updated according to the locations of these leading wolves.
ω denotes the position of the search agent in the Grey Wolf Optimization algorithm. This behavior can be mathematically represented as follows:
, , and describe the position vectors of the , , and wolves, respectively, within the existing group; X denotes the position vector of a gray wolf; , , and represent the distances between the current candidate wolf and the top three wolves.
(3) Attacking prey: When designing the predator–prey attack model, decreasing ‘’ causes ‘’ to vary, as defined by the encirclement formula. In essence, ‘’ represents a random vector within the interval [−a, a], and ‘’ diminishes progressively with every iteration. When ‘’ falls between [−1, 1], the Search Agent’s subsequent position may lie between the current gray wolf and the prey.
3.2.1. Key Steps in the Grey Wolf Optimizer Algorithm
The position of each search agent is updated by a mechanism inspired by the hunting behavior of grey wolves. The update rule is based on the current position of the agent and its distance from the three best agents (
,
,
). This ensures that the search agents converge towards the optimal solution while also exploring the search space. The position update is mathematically formulated as:
Here, represents the current position of the agent, and is the position of the best search agent (the wolf). The parameters and are coefficient vectors that guide the agent’s movement towards the optimal solution.
The coefficients
,
, and
are updated during each iteration. The parameter
α decreases linearly, shifting the algorithm from exploration (initial stages) to exploitation (later stages). The vectors
and
are computed using random values, ensuring that the search agents move towards the prey (optimal solution) while maintaining randomness to prevent local optima stagnation. Specifically,
is updated as:
Each search agent’s fitness is evaluated at every iteration, determining how well it performs according to the problem’s objective function. The three best solutions—, , and —are then identified. The wolf represents the global best solution, while the and wolves represent the second and third best solutions, respectively. The positions of the , , and wolves guide the other agents and ensure that the search is directed towards the optimal solution.
The positions of the , , and wolves are updated regularly, and if any agent surpasses the current best solution, it will replace the respective wolf in the hierarchy. This process ensures that the best solutions remain as the reference for other agents. Over time, the algorithm moves closer to the global optimum by updating these best positions, leading to an effective search for optimal solutions in complex problem spaces.
The computational complexity of the IGWO algorithm is primarily determined by the number of search agents , the number of iterations , and the dimensionality of the problem . In each iteration, the fitness of all search agents is evaluated (), and their positions are updated based on three leader wolves (). Therefore, the overall complexity of the algorithm is approximately . Since fitness evaluation is model-based (e.g., training a neural network), its cost dominates the computation, and reducing training time or using early stopping can significantly improve efficiency.
3.2.2. Improved Grey Wolf Optimization
Since the method used by the gray wolf algorithm to initialize the gray wolf population is an ordinary random number generator, these generators may have the problems of short period and poor randomness, resulting in the generation of the initial population of low quality and may also lead to the initial population may not be able to cover the entire search space well, thus affecting the global search capability of the algorithm. In order to solve this problem, we proposed Mersenne Twister-based Grey Wolf Optimization on the basis of the Grey Wolf Optimization algorithm, which uses Mersenne Twister to generate random numbers with long period and good randomness to improve the diversity and coverage of the initial populations, and effectively solves the problem of Grey Wolf Optimization.
The Mersenne Twister algorithm is used to initialize the grey wolf population in the GWO algorithm. This initialization step is crucial, as it generates a set of initial solutions for the population, which serves as the starting point for the optimization process. By employing the Mersenne Twister, we ensure that the initial solutions are distributed uniformly and randomly, which enhances the diversity of the population. A diverse population allows the algorithm to explore the solution space more effectively, thereby improving its overall performance and convergence speed.
The pseudo-code for the algorithm begins by initializing the grey wolf population using the Mersenne Twister. Additionally, the parameters
are initialized to control the search process. The fitness of each search agent is then calculated, identifying the positions of the best, second-best, and third-best search agents, denoted as , and, , respectively.
The algorithm then enters a loop that continues until the maximum number of iterations is reached. In each iteration, the positions of all search agents are updated based on their current position and their proximity to the best positions found so far. The parameters , and C are updated, and the fitness of all agents is recalculated. The positions of the best agents , and are updated accordingly. The loop continues until the stopping criterion, defined by the maximum number of iterations, is satisfied.
Finally, the best solution, represented by
, is returned as the optimal solution found by the algorithm. This process allows the GWO algorithm to balance exploration and exploitation, effectively converging toward an optimal solution. The pseudo-code for the Algorithm 1 is as follows:
Algorithm 1: The Pseudo-Code of the IGWO Algorithm |
The pseudo-code of the algorithm is as follows |
Initialize the grey wolf population (i = 1, 2, …, n) with Mersenne Twister |
Initialize α, A, and C |
Calculate the fitness of each search agent |
= the best search agent |
= the second best search agent |
= the third best search agent |
While (t < Max number of iterations) |
for each search agent |
Update the position of the current search |
end for |
Update a, A, and C |
Calculate the fitness of all search agents |
Update , , |
t = t + 1 |
end while |
return |
The overall workflow of the IGWO algorithm is illustrated in
Figure 2.
3.3. The Multi-Head Attention Mechanism
The Multi-Head Attention mechanism in the Transformer parallelizes the self-attention process by mapping the input queries, keys, and values into multiple subspaces. Self-attention is computed for each subspace, and the outputs are combined and linearly transformed to produce the final result. This approach allows the model to capture diverse relationships within the input sequence, enhancing its representational capacity.
As shown in
Figure 3, the multi-head attention (MA) mechanism starts by applying scaled dot-product attention to the inputs
V,
Q, and
K through linear transformations, computing each head separately. Although there are multiple heads with distinct parameters WWW for the linear transformations of
Q,
K, and
V, it is collectively referred to as multi-head attention. Finally, the outputs of all heads are concatenated and linearly transformed to produce the final output of the MA mechanism.
The MA mechanism involves executing multiple self-attention operations on the initial input sequences
,
, and
. Afterward, it concatenates the outcomes from each self-attention set and applies a single linear transformation to derive the final output. Specifically, its calculation formula is:
The multi-head attention mechanism enables the model to focus on information from various subspaces at different positions, making it more effective than single self-attention.
3.4. The IGWO-MALSTM Model
To address the high cost and time-consuming nature of manually selecting hyperparameters for LSTM models, as well as their poor long-term forecasting capabilities. Building on the original GWO algorithm, this paper improves its initialization process. We propose the IGWO algorithm to optimize LSTM model hyperparameters, minimizing manual intervention and subjectivity in hyperparameter selection and enhancing its predictive capabilities. Additionally, a multi-head attention mechanism is introduced to adjust the weight allocation of gradients, better controlling their magnitude. Attention weights are incorporated at each time step of the LSTM to measure the correlation between the input and previous hidden states. These attention weights, calculated based on the similarity between the input and hidden states, enable more stable gradient flow during training. By introducing the attention mechanism, the gradient propagation path can be more precisely controlled, mitigating gradient explosion issues and enhancing LSTM’s long-term forecasting capabilities.
In summary, integrating the multi-head attention mechanism into LSTM and optimizing hyperparameters with IGWO addresses the gradient explosion issue, enhancing the model’s stability and predictive performance. This approach offers a novel method for improving LSTM networks’ ability to handle complex sequential data, with broad application potential.
Structure of IGWO-MALSTM
The architecture of the IGWO-MALSTM model primarily comprises an input layer, three LSTM layers, an attention layer, a dense layer, and an output layer. The key innovation of this model lies in the use of the IGWO algorithm to optimize critical hyperparameters, including lookback period, number of neurons, dropout rate, batch size, and epochs.
The IGWO algorithm serves as a crucial mechanism for optimizing neuron configurations within the LSTM and attention layers, including adjustments to neuron counts, structural hierarchy, and parameters. It employs an adaptive iterative process to select the optimal neuron configurations and parameters, thereby enhancing the network’s overall performance. Initially, data are processed through the LSTM layers, which handle time series by integrating forward and backward information. This functionality enables the model to capture long-range dependencies within the sequence. The output of the LSTM layers, consisting of contextual information for each point in the time series, is then passed to the subsequent multi-head attention layer.
In the multi-head attention layer, the model independently processes the LSTM layers’ output using multiple attention heads, each focusing on distinct feature subsets. This mechanism allows the model to discern the internal structure of the data across different representational subspaces, providing a more comprehensive understanding of the data. Through this process, the model extracts information from various perspectives, enhancing the representation of critical features and ultimately producing more accurate and relevant results.
The model first addresses temporal dependencies through the LSTM layers and then enhances features via the multi-head attention layer, integrating the strengths of both components to improve its ability to handle complex sequential data. The structure of the IGWO-MALSTM model is illustrated in
Figure 4.
Figure 5 presents a simplified framework diagram of the IGWO-MALSTM model, illustrating its data processing workflow.
The IGWO-MALSTM model is constructed using algorithmically optimized parameters to enhance structural clarity and decision-making capacity. This optimization process highlights the rationale behind our parameter choices and model design, minimizing human intervention and improving the clarity of the model’s structure and parameters.