1. Introduction
Crude oil is an important global commodity and industrial foundation. Its price is a closely watched international market indicator, reflecting changes in the macroeconomy and the energy system [
1,
2]. Brent crude oil is the core of global pricing, reflecting changes in supply and demand and related to major benchmark prices [
1,
3]. However, the price fluctuations of Brent crude oil are not only caused by traditional supply and demand relations. Many factors contribute to price formation, such as geopolitical risks [
4], policy or financial changes [
5,
6], energy, climate impacts [
7], etc. They interact to generate complex market dynamics, making prices fluctuate and changing risks. Studies show that crude oil price series are non-stationary, have high-frequency fluctuations [
8], and may also have nonlinear and chaotic characteristics [
9].
Recent studies have also shown that entropy-based signal decomposition methods can enhance feature representation by explicitly minimizing entropy in the preprocessing stage, thus strengthening the link between information theory and financial forecasting methods [
10,
11,
12,
13,
14,
15]. In addition, for financial price series such as crude oil prices, the random-walk or no-change forecast provides a canonical benchmark that is often difficult to outperform in out-of-sample prediction tasks. Recent studies emphasize that the choice and correct specification of this benchmark are critical for interpreting forecasting performance [
16,
17].
Early studies on oil price forecasting relied on qualitative assessments and traditional econometric models, such as ARIMA, VAR, and GARCH [
18,
19,
20]. These models remain useful for describing statistical regularities, volatility dynamics, and economic interrelationships in oil markets. For example, Cevik and Gillman (2025) used the local projection approach to examine the response of the oil market to macroeconomic policy shocks [
21]. However, their value for economic interpretation should be distinguished from their predictive performance in out-of-sample forecasting settings.
Traditional econometric models typically depend on relatively strict parametric assumptions and linear statistical structures, which may become restrictive when oil markets experience structural breaks, parameter instability, or regime changes [
22]. Recent studies further suggest that oil price formation exhibits pronounced nonlinear and time-varying characteristics, implying that more flexible models may be needed to capture its evolving dynamics [
23]. At the same time, recent reassessments of oil price forecasting have shown that some previously reported forecasting gains are sensitive to benchmark specification and evaluation design, especially when the random-walk or no-change benchmark is omitted or inadequately defined [
16,
17,
24]. This issue becomes even more relevant in high-dimensional environments characterized by structural change, policy adjustment, and regulatory shifts, where forecast robustness is critical [
25]. Against this background, deep learning has introduced data-driven alternatives such as LSTM and GRU [
26,
27], which have been increasingly applied in energy price forecasting [
28]. Nevertheless, when the original price series contains substantial noise and non-stationarity, directly feeding such sequences into deep neural networks may lead to unstable training and degraded forecasting performance.
To solve the problem, the decomposition-integration strategy has attracted attention in recent years [
29,
30]. In this framework, the original time series is first decomposed into components with different frequency characteristics, and then each component is modeled separately. This method helps the model capture the multi-scale dynamics of complex financial time series. Among the decomposition techniques, Variational Mode Decomposition (VMD) shows great potential in crude oil price prediction due to its ability to adaptively separate signals in the frequency domain [
10,
31].
Conventional trial-and-error procedures for parameter selection are inefficient and difficult to apply in large-scale forecasting environments. Consequently, meta-heuristic optimization algorithms have increasingly been introduced to determine optimal parameter configurations automatically [
32]. Swarm intelligence methods such as Particle Swarm Optimization (PSO) and the Grey Wolf Optimizer (GWO) have been used to jointly optimize the parameters
, reducing subjective bias in parameter selection. Nevertheless, these algorithms may still suffer from premature convergence and an imbalance between exploration and exploitation, which can hinder both global search capability and convergence efficiency [
33,
34,
35].
At the subsequent forecasting stage, model design has gradually evolved from single LSTM or BiGRU structures toward more sophisticated hybrid frameworks incorporating attention mechanisms [
36]. However, Transformer-based architectures, which rely on standard self-attention mechanisms, often encounter severe computational challenges in long-sequence time series forecasting tasks. Specifically, the computational complexity of self-attention grows quadratically with sequence length, i.e.,
, which limits scalability when dealing with long historical sequences [
37,
38,
39]. To alleviate these limitations, recent studies have explored lightweight combinations of convolutional neural networks (CNNs) and attention mechanisms, aiming to enhance local feature extraction while reducing computational overhead [
40,
41]. Within this context, the CNN autoencoder-based score attention (CASA) mechanism provides an efficient alternative. By leveraging lightweight CNN autoencoders to reconstruct latent representations of local temporal segments and assigning importance scores, CASA avoids global dot-product operations, reducing both time and memory complexity to linear order while improving inference efficiency [
42].
Motivated by the above considerations, and aiming to jointly address the parameter sensitivity of VMD and the computational inefficiency of conventional attention mechanisms, this paper proposes a novel hybrid forecasting framework, termed ALA-VMD-CASA.
The main contributions of this study can be summarized in two aspects. First, we developed an adaptive optimization mechanism based on the ALA to determine the key parameters of VMD, thereby improving the stability and effectiveness of multi-scale signal decomposition while linking decomposition quality with an envelope entropy minimization objective. Second, we integrate the optimized VMD decomposition with a lightweight CASA to construct an efficient hybrid forecasting framework capable of capturing multi-scale temporal dependencies while maintaining linear computational complexity, which significantly enhances forecasting performance and robustness for crude oil price series.
The structure of the paper is as follows.
Section 2 elaborates on the method, including VMD entropy-driven optimization based on ALA, the VMD framework, and the CASA mechanism.
Section 3 describes the experimental design: covering data, models, evaluation indicators, results, ablation experiments, and robustness analysis (such as window rolling prediction, multi-step prediction, random initialization test).
Section 4 conducts a summary, re-emphasizes the ability of the framework to reduce market entropy, and explores potential extension directions.
2. Methodology
This decomposition constructs the multi-scale structure in the data, and can analyze different time dynamics separately. An adaptive entropy-driven mechanism is set up to ensure that the decomposition can accurately reflect the underlying structure of the series and determine the settings for the decomposition. Based on the entropy-selected parameters, this method enhances the decomposition and characterization of complex and uncertain market dynamics. After decomposition, feature learning is further enhanced by the CASA mechanism based on the score-based attention structure. This model enhances the characterization of time patterns, improves the ability to capture dependencies from different time scales, and reduces the computational complexity. The obvious features after decomposition and the extracted feature representations are added to the prediction model, and then the prediction results of crude oil prices are generated.
2.1. Variational Mode Decomposition
VMD was proposed by Dragomiretskiy and Zosso [
10]. It is an adaptive signal processing technique used to decompose signals into band-limited modes. For early decomposition methods such as EMD [
43,
44] and its ensemble empirical mode decomposition [
45], VMD constructs the decomposition problem in a well-defined variable optimization framework. This formulation alleviates the problems of mode mixing and weak theoretical basis encountered by EMD-type methods. Unlike Empirical Mode Decomposition (EMD), VMD aims to decompose an input signal
into
K discrete sub-signals
, where each mode is compactly supported around its corresponding center frequency with a limited bandwidth.
To estimate the bandwidth of each mode, VMD constructs a one-sided frequency spectrum via the Hilbert transform and subsequently shifts the spectrum of each mode to the baseband. The signal decomposition procedure is therefore formulated as a constrained variational optimization problem, which seeks to minimize the sum of the bandwidths of all modes subject to the reconstruction constraint that the superposition of all modes equals the original signal. The resulting optimization problem can be written as follows:
In this formulation,
denotes the set of decomposed modes, while
corresponds to their associated center frequencies. The symbol
represents the Dirac delta function, and the operator * indicates convolution.
To address the constrained optimization problem described above, a quadratic penalty term
together with a Lagrange multiplier
is introduced. This transformation converts Equation (
1) into an augmented Lagrangian form without explicit constraints. The resulting saddle-point solution can then be efficiently obtained using the Alternating Direction Method of Multipliers (ADMM), which iteratively estimates the optimal set of variational modes.
From a functional perspective, the VMD process maps the input sequence to the set of decomposed components. For a discrete-time input sequence
, the resulting set of modes is largely determined by two parameters: the number of modes
K and the penalty coefficient
. This relationship can be written as:
which indicates that the characteristics of the resulting mode set are governed directly by the parameter pair
.
Figure 1 presents the process of VMD decomposition. The original time series is decomposed into intrinsic mode functions (IMFs), and each IMF corresponds to a different frequency range. For example, IMF0 is mainly the low-frequency trend, IMF1 is the medium-frequency oscillation, and IMF2 is the high-frequency variable/noise. The result is that VMD adaptively separates each frequency band signal component in a non-recursive way.
From an economic perspective, the extracted IMFs can reflect the situation of crude oil price dynamics at different time scales. The low-frequency component IMF0 mainly reflects the long-term structural changes in oil prices. This change is related to factors driven by funds such as global supply and demand, changes in energy structure, macroeconomic cycles, inflation expectations, and U.S. dollar fluctuations. Over time, these factors change, shape the trend of oil prices, and also provide information for medium- and long-term forecasting.
2.2. Artificial Lemming Algorithm
Selecting appropriate values for the VMD parameters
is essential for obtaining reliable decomposition results. This task can be formulated as a nonlinear optimization problem in which parameter choices directly influence the quality of the extracted signal components. To address this issue in an automated manner, the present study employs the Artificial Lemming Algorithm (ALA), a recently proposed bio-inspired meta-heuristic introduced by Xiao et al. [
46]. The algorithm is inspired by the collective foraging and migration behavior observed in lemming populations and provides an adaptive search strategy suitable for complex optimization problems.
The main mechanisms underlying ALA are outlined below.
- (1)
Energy control mechanism.
In the Artificial Lemming Algorithm, the behavioral dynamics of the population are regulated through an energy factor
E. This factor gradually decreases as the iteration process proceeds and determines the transition between the global search stage and the local refinement stage of the algorithm. The energy level at iteration
t is defined as:
where
t denotes the current iteration number and
represents the maximum number of iterations allowed in the optimization process.
- (2)
Exploration phase ().
When the energy level remains relatively high, the algorithm emphasizes exploration of the search space. In this stage, individual lemmings perform large-step movements in order to explore new regions and reduce the risk of being trapped in local optima. The position of the
i-th lemming is updated according to
where
represents a randomly selected position within the search space, and
is a control parameter that regulates the magnitude of the movement.
- (3)
Exploitation phase ().
When the algorithm enters the exploitation stage, lemmings conduct intensive local searches in the vicinity of the currently identified best solution
. The corresponding position update rule is given by
where
denotes a convergence control parameter.
- (4)
Parameter mapping for VMD optimization.
Within the proposed framework, the position vector of each lemming directly corresponds to a candidate VMD parameter set. Specifically, let . Through iterative updates of , ALA minimizes a predefined fitness function, thereby identifying the globally optimal parameter combination .
Through the above mechanisms, ALA is capable of efficiently converging toward the global optimum in a complex parameter space.
The complete ALA-based VMD optimization procedure is summarized in Algorithm 1.
| Algorithm 1 ALA-based Adaptive Optimization of VMD Parameters |
- 1:
Input: Original time series , population size , maximum iteration number , search bounds of VMD parameters and . - 2:
Output: Optimal VMD parameter set . - 3:
Randomly initialize the lemming population - 4:
For each lemming individual , perform VMD on and evaluate the fitness value based on envelope entropy - 5:
Determine the current global best solution - 6:
Set iteration counter - 7:
while do - 8:
Compute the energy factor E according to Equation ( 3) - 9:
for each lemming individual do - 10:
if then - 11:
(Exploration phase) - 12:
Randomly select a position from the current swarm - 13:
Update using the exploration rule (Equation ( 4)) - 14:
else - 15:
(Exploitation phase) - 16:
Update using the exploitation rule (Equation ( 5)) - 17:
end if - 18:
Apply boundary constraints to ensure , , and enforce as an integer - 19:
end for - 20:
Recalculate fitness values and update - 21:
Set - 22:
end while - 23:
Return the optimal VMD parameter set
|
2.3. CNN Autoencoder Score Attention
Although Variational Mode Decomposition (VMD) alleviates the non-stationarity of the original time series, effective forecasting of each decomposed mode
still requires accurate modeling of local morphological patterns and latent global dependencies. To address this issue, this study introduces the CASA mechanism. Originally proposed by Lee et al. in 2025, CASA replaces the quadratic-complexity dot-product attention in standard Transformers with a CNN-based linear-complexity attention mechanism, thereby significantly reducing computational overhead while preserving representational capacity [
42].
The CASA module consists of two parallel processing paths for the input embedding sequence
X. The overall architecture of the CASA block is illustrated in
Figure 2.
- (1)
Score generation path.
First, CASA constructs an attention score generation path using a one-dimensional convolutional neural network (1D-CNN) autoencoder to extract local morphological features from the input sequence. The encoder compresses local contextual information, while the decoder reconstructs the latent feature representation. The reconstructed output is then normalized by a Softmax function to produce the attention score vector
, given by:
- (2)
Value generation path.
In the value generation path, the input sequence
X is projected into a value representation
V through a linear embedding layer:
- (3)
Feature fusion stage.
At the feature fusion stage, the final output
O is obtained via element-wise multiplication, whereby the attention scores derived from local morphological features are used to reweight the value representation:
where “⊙” denotes the Hadamard (element-wise) product. The resulting output
O is subsequently fed into a feed-forward network (FFN) to generate the final forecasting results. The CASA block adopts a Transformer-style residual structure with Add and Norm operations to enhance training stability and representation capacity.
As a result, CASA introduces a more appropriate inductive bias for modeling complex and non-stationary energy price series, complementing the multi-scale decomposition achieved by VMD and further improving the overall predictive stability of the proposed framework.
2.4. The Proposed AI-Driven Entropy-Aware Forecasting Framework
This section presents the overall architecture of the proposed ALA-VMD-CASA ensemble forecasting framework. The framework establishes an end-to-end mapping from parameter optimization to final forecasting through four sequential stages. The logical connections among modules and the associated data flow are formalized as follows.
- (1)
ALA-based parameter adaptive optimization.
Let the original Brent crude oil price series be denoted as . The first stage aims to address the optimal selection of the VMD hyperparameter pair .
The population search space of the Artificial Lemming Algorithm (ALA) is defined as , where the state vector of the i-th individual is given by , which directly corresponds to a candidate parameter configuration for VMD.
The objective function is defined as the minimum envelope entropy of the decomposition results. Accordingly, the ALA optimizer iteratively searches for the global optimum by solving the following minimization problem:
where
denotes the VMD operator and
denotes the envelope entropy.
In this process, ALA serves as the outer optimizer that dynamically updates the parameter candidates, while VMD acts as the inner evaluator that feeds back the decomposition quality.
- (2)
Signal adaptive decomposition.
Using the optimal parameter set
obtained in Stage 1, the original non-stationary series
is decomposed via VMD. This operation maps the original signal space into an orthogonal subspace composed of
intrinsic mode functions (IMFs):
The resulting subseries effectively separate high-frequency noise, medium-term fluctuations, and long-term trends embedded in the original data, thereby substantially reducing signal non-stationarity and structural complexity.
- (3)
Parallel CASA forecasting.
For each decomposed component
, an independent CASA-based forecasting channel is constructed. First, a sliding-window strategy is employed to reconstruct the one-dimensional time series into supervised learning samples. Specifically, for the
k-th mode, the input feature matrix at time
t is defined as
where
w denotes the window length.
Subsequently, a nonlinear mapping function is constructed. The
k-th CASA model extracts local morphological features from
through its internal CNN autoencoder and outputs the one-step-ahead forecast:
- (4)
Nonlinear ensemble aggregation and output.
Given that VMD satisfies the completeness constraint of signal reconstruction, i.e.,
, the final Brent crude oil price forecast can be obtained by linearly aggregating the predicted values of all subcomponents:
In summary, the proposed framework ensures optimal signal decomposition through ALA-driven adaptive parameter optimization, achieves effective stabilization of non-stationary price dynamics via VMD, and enhances sensitivity to localized structural patterns through CASA-based forecasting. The ALA-VMD-CASA framework integrates adaptive optimization, multi-scale signal decomposition, and attention-based prediction, and alleviates the situations of parameter sensitivity, cross-scale interaction, and noise interference. This integration enables the model to perform stable prediction, which is applied to analyze crude oil price series with strong fluctuation and non-stationary characteristics.
3. Experiments and Results
3.1. Data Source and Preprocessing
The data source for the empirical analysis is the Federal Reserve Economic Data (FRED) database. The Brent crude oil price series (DCOILBRENTEU) serves as a key global benchmark for international oil markets and is widely used in empirical energy economics and financial forecasting studies, ensuring both data reliability and economic relevance. The data sample consists of daily data from 4 January 2010 through 28 November 2025. After removing missing values, the dataset contains a total of 4150 valid daily records.
Figure 3 presents the daily Brent crude oil price series over the sample period and illustrates the chronological partition between the training and testing sets.
Similar to time series prediction research, the dataset is divided into training set and test set according to time for out-of-sample performance evaluation [
29]. The first 3330 observations (from 4 January 2010 to early October 2022, a total of approximately twelve years) are used for model estimation and parameter calibration. The remaining 820 observations (from 10 October 2022 to 28 November 2025, approximately three years) are used for out-of-sample testing. This rolling forward division makes the data have time series ordering, and there is no information leakage between training and evaluation. Such a setting can evaluate the model prediction ability in a prediction situation closer to the real world. The overall architecture of the proposed ALA-VMD-CASA forecasting framework is illustrated in
Figure 4.
3.2. Benchmark Models and Evaluation Metrics
To evaluate the forecasting performance of the proposed ALA-VMD-CASA framework, we compare it with a set of benchmark models including ARIMA, the random-walk benchmark (RW), the random-walk-with-drift benchmark (RWWD), LSTM, GRU, VMD-GRU, Transformer, and Informer. The inclusion of RW and RWWD is essential because random-walk-type forecasts have become increasingly important benchmark references in the evaluation of financial and commodity price forecasting models [
16,
17]. In the daily Brent crude oil price setting considered here, RW corresponds to the canonical end-of-day no-change forecast, namely, the forecast for time
is the observed price at time
t. RWWD extends this benchmark by allowing for a constant drift term estimated from the training sample. Reporting both RW and RWWD enables us to compare the proposed framework not only against the canonical no-change benchmark for daily financial prices, but also against a closely related drift-adjusted naive benchmark. The selected benchmark set therefore covers traditional statistical methods, random-walk-type naive forecasts, standard deep learning architectures, decomposition-based hybrid methods, and recent attention-based forecasting models, providing a more comprehensive basis for evaluating the effectiveness and robustness of the proposed framework.
All benchmark models are implemented under the same experimental configuration to ensure comparability. The length of the sliding window is 12. The recurrent neural network uses 100 hidden units, the batch size is 64, and it is trained for 100 epochs with a learning rate of 0.0001. The signal decomposition based on VMD is implemented by vmdpy. Under comparable computational configurations, the Transformer and Informer models are trained using only the training set, and the reported evaluation results are calculated based on a separate test set, so that the comparison can reflect the real out-of-sample prediction performance. Among these controlled conditions, the difference in performance is mainly due to the difference in the modeling ability of each method, not the difference in the training procedure, which helps to make the experimental results reliable and reproducible. To prevent information leakage, all models strictly use only historical observations available up to time t when generating forecasts for time ; no contemporaneous or future information is included in the input features at any stage of the forecasting pipeline.
To quantitatively evaluate forecasting accuracy from multiple perspectives, four widely adopted statistical error metrics are employed: Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Mean Absolute Percentage Error (MAPE), and the coefficient of determination (
). These metrics are defined as follows:
where
denotes the observed value,
represents the predicted value,
is the mean of the observed values, and
n is the number of samples in the testing set. Lower values of MAE and RMSE indicate smaller prediction errors, while an
value closer to unity implies stronger explanatory power with respect to price variations.
3.3. Results and Discussion
The empirical results include four parts: convergence analysis of the optimization process, comparison of model prediction performance, inspection of the time evolution of prediction errors, and evaluation of statistical significance.
Before comparing performance, first study the convergence characteristics of VMD parameter optimization based on ALA.
The optimal fitness of envelope entropy in
Figure 5 changes during the optimization iteration. The curve drops sharply at first and then quickly stabilizes, which indicates that ALA can quickly reach a good solution and can also maintain a stable trajectory in the subsequent process.
In several iterations, the ALA algorithm is fast and stable, maintaining the balance between exploration and exploitation. This is very important for the number of discrete modes of VMD and the optimization of continuous penalty parameters. A smooth convergence curve means that there is no large oscillation or early stagnation in optimization, which often occurs when meta-heuristic algorithms search in heterogeneous parameter spaces. Since VMD decomposition is sensitive to the parameter set, stable convergence is necessary for consistent decomposition results after multiple runs.
3.3.1. Comparative Performance Analysis
Table 1 presents the one-step-ahead prediction results of Brent crude oil prices by all candidate models. Four commonly used indicators are used to evaluate model performance:
, MSE, RMSE, MAE, and MAPE. It can be seen from the comparison results that there are differences among traditional statistical methods, traditional deep learning architectures, decomposition-based hybrid models and the proposed framework.
First, ARIMA(2,0,2), RW, and RWWD serve as benchmark models. Among them, RW corresponds to the canonical end-of-day no-change forecast for daily financial prices, while RWWD represents a closely related random-walk benchmark with drift [
16,
17]. The
values of these three models are all about 0.967, and the MSE is about 2.24. This result indicates that, in daily crude oil price forecasting, the no-change or random-walk-type benchmark is difficult to surpass, which is consistent with the efficient market view and recent oil forecasting literature [
16,
17,
24]. When directly applied to the original price series, standalone deep learning models are not always superior to the benchmark methods. The
of LSTM and GRU are 0.942 and 0.950 respectively, and the MSE is 3.775 and 3.238, both underperforming RW. This shows that recurrent neural networks have difficulty capturing stable information from fluctuating and non-stationary prices. The attention mechanism narrows the gap. The
obtained by Transformer and Informer are 0.961 and 0.959, which are close to those of the benchmark models, but their MSE values are still slightly higher. These results suggest that without appropriate signal preprocessing, even advanced deep learning models may fail to outperform the canonical random-walk/no-change benchmark [
16,
24]. At the same time, these models directly act on the original price sequence without clearly distinguishing multi-scale fluctuations from noise, which may constrain further performance gains. Importantly, these results show that none of the conventional statistical or deep learning benchmark models consistently outperform the random-walk/no-change benchmark under the present experimental design, highlighting the strength of this benchmark in daily crude oil price forecasting.
Figure 6 shows the absolute prediction errors of all the comparison models during the test period to study the evolution of the errors. When the crude oil price fluctuates, the prediction errors of all methods change.
Pointwise error metrics can describe short-term accuracy, but they cannot fully reflect the long-term cumulative situation of prediction errors. To make up for this shortcoming,
Figure 7 shows the CAE trajectories of all evaluation models in the test samples.
The comparison between
Figure 6 and
Figure 7 highlights the difference between short-term error control and long-term error accumulation. Short-term robustness reflects the model’s ability to handle sudden market fluctuations, while long-term reliability depends on whether small forecast errors accumulate over time. Thus, models that lack structural stability may exhibit considerable error accumulation when evaluated over longer periods of time.
To evaluate whether the observed performance differences are statistically significant differences caused by non-sample variance, a paired DM test is carried out between the ALA-VMD-CASA and the comparison models. The DM test tests the null hypothesis that the prediction accuracies are equal by testing whether the expected value of the loss difference sequence is not equal to zero. This framework is applicable to the comparison of time series prediction models and can handle the serial correlation and heteroscedasticity in prediction errors. In this study, the DM test is carried out under the condition of one-step-ahead prediction, using MAE and MSE as loss functions, and can be robustly evaluated under the standards of absolute error and squared error.
Table 2 shows that the ALA-VMD-CASA is significantly better than all the benchmark models. In all pairwise comparisons, the DM statistic is negative, which indicates that the prediction error of the proposed framework is smaller. The relevant
p-value is less than
, so the null hypothesis of equal prediction accuracy is rejected at the 1% significance level. Due to a large number of pairwise comparisons, we test whether the statistical conclusion is sensitive to the multiple test effect. Holm’s stepwise method and Benjamini–Hochberg FDR correction are adopted. The adjusted conclusions of the two methods are unchanged, which means that the superior performance of ALA-VMD-CASA is not due to multiple comparisons, but the stable improvement of prediction accuracy.
Thirdly, the decomposition–prediction–reconstruction paradigm can separate multi-scale components to improve prediction accuracy, and its advantages vary due to different model combinations. In
Table 1, VMD-GRU is better than GRU. The
increases from 0.950 to 0.959, and MSE decreases from 3.238 to 2.785, showing that VMD reduces the complexity of modeling and also improves the prediction ability of decomposed subsequences. VMD-Informer is slightly better than Informer, and MSE is reduced from 2.676 to 2.632. And VMD-Transformer is not obviously better than Transformer: although the
is almost the same (0.961 vs. 0.961), MSE has a slight increase, changing from 2.522 to 2.593. These results show that decomposition alone cannot improve the generalization ability; effectiveness is related to the decomposition mechanism and the prediction architecture. Finally, the ALA-VMD-CASA framework has the best overall performance, with an
reaching 0.986, and the MSE, RMSE, MAE, and MAPE are reduced to 0.911, 0.954, 0.745, and 0.952, respectively. Compared with the best standalone model Transformer (MSE = 2.522), the proposed framework reduces MSE to 0.911, an improvement of about 63.9%. Compared with the similar VMD–Informer (MSE = 2.632), it further reduces MSE by about 65%. The income comes from VMD multi-scale decomposition, ALA adaptive parameter optimization, as well as CASA local structure pattern and key fluctuation dynamics. From the application perspective, the MAE drops to 0.745, which reduces the average deviation between the predicted price and the observed price, thus providing more reliable quantitative support for crude oil market price prediction, risk management, and trading strategy development.
To further assess the goodness of fit,
Figure 8 depicts the relationship between predicted and observed Brent crude oil prices generated by the ALA-VMD-CASA model, along with the corresponding fitted regression line. The tight clustering of data points around the fitted line demonstrates a strong linear association and a high degree of consistency between model predictions and observations. This pattern indicates that the proposed model does not exhibit systematic overestimation or underestimation across different price levels.
Figure 9 displays the associated residual distribution with respect to the observed values. The residuals are generally centered around zero and do not exhibit obvious systematic patterns, suggesting that prediction errors remain stable across the examined range. The lack of visible structure in the residual series also indicates that the principal nonlinear characteristics of the data have been effectively captured by the model. From the perspective of financial market applications, the substantial reduction in forecasting errors achieved by the ALA-VMD-CASA framework has practical implications. More accurate predictions lead to narrower confidence intervals for price forecasts, more reliable Value-at-Risk assessments, and clearer signals for algorithmic trading strategies. By reducing the uncertainty embedded in the residual series, the framework allows market participants to respond to market changes with greater precision, which is particularly valuable during periods of heightened volatility or structural shifts. This finding carries important implications for the broader literature. Specifically, it calls for caution regarding the common claim that machine learning models uniformly outperform traditional statistical benchmarks, particularly in studies where the random-walk or no-change benchmark is absent or improperly specified. By explicitly incorporating RW and RWWD as benchmarks and adopting a rigorous evaluation design, this study demonstrates that only the ALA-VMD-CASA framework achieves consistent and statistically significant outperformance over these competitive naive baselines.
3.3.2. Ablation Study Analysis
Table 3 presents the results of ablation experiments under different model configurations, which are used to test the contributions of each part of the main framework separately.
The effectiveness of the parameter optimization strategy is first evaluated by comparing ALA-VMD-CASA with variants employing other meta-heuristic optimizers. Under the same decomposition and optimization conditions, the roles of the prediction modules are compared between ALA-VMD-DLinear and ALA-VMD-CASA. After replacing DLinear with CASA, the MSE changes from 2.404 to 0.911, a decrease of 62.1%. DLinear can capture long-term trends, but it is not good at fine price fluctuations. CASA combines convolutional feature extraction and attention mechanism, and is more effective in identifying structural patterns in crude oil price changes and volatility dynamics.
The importance of the decomposition stage is evaluated by comparing the CASA benchmark model with the complete proposed framework. Introducing adaptive decomposition can improve the prediction performance, which shows that the decomposition–prediction–reconstruction framework is effective in energy price prediction. Dividing the highly non-stationary crude oil price series into stable components reduces the complexity of the learning task and also improves the prediction accuracy and generalization ability.
The evidence of comprehensive comparison and ablation experiments indicates the key role of decomposition quality and prediction architecture design in crude oil price prediction. Combining adaptive optimized decomposition with attention-based prediction mechanism can capture multi-scale information, and the proposed hybrid framework has a significant improvement in accuracy and stability, becoming a practical tool for price prediction in the energy market.
Although the preceding tables and figures characterize overall forecasting performance differences among competing models, we cannot completely distinguish the effects of adaptive optimization, signal decomposition, and attention mechanism, nor can we show how these components affect the evolution of prediction errors over time. To explore these situations,
Figure 10 and
Figure 11 respectively present the absolute prediction errors and cumulative absolute errors of each ablation variant. This additional analysis can more clearly reveal the internal mechanism behind the excellent performance of the ALA-VMD-CASA framework.
There are differences in the short-term error behavior of the ablation variants in
Figure 10. The model with adaptive parameter optimization and signal decomposition has a relatively smooth error trajectory, especially when the price fluctuation increases. The model without these components has frequent sudden error peaks, which indicates that it is more sensitive to sudden market disturbances. The PSO-VMD-CASA variant is very unstable, and there are obvious prediction error fluctuations sometimes. This behavior indicates that an inappropriate or unstable parameter set may make the signal decomposition inconsistent, thereby amplifying the short-term prediction error.
The short-term deviation in
Figure 11 accumulates over time. The CAE curve shows that the model without adaptive optimization or an effective attention mechanism accumulates prediction errors relatively fast. The local prediction is inaccurate and not well corrected, and it is passed on in subsequent predictions, making the overall error increase. This pattern reflects that when decomposition is stable or the feature weight mechanism is insufficient, there are limitations in the structure that controls error accumulation. The cumulative error growth is the lowest and the cumulative pattern is more stable when ALA-VMD-CASA is tested. This display framework makes the point-by-point prediction accuracy improved and the overall error generation dynamics reduced. By stabilizing signal decomposition and re-evaluating the variance in the subtraction parameter selection to highlight local feature information, the constructed framework can avoid the gradual accumulation of prediction errors. Therefore, this model has stronger long-term stability and robustness.
3.4. Robustness Analysis
The role of each component in the error dynamics in the pre-ablation experiment. But additional analysis is needed to verify whether the advantages still exist under different experimental conditions. Crude oil prices have obvious characteristics of non-stationarity, state transition and structural breaks. The conclusion obtained from the single training–test split may be affected by sample-specific factors.
To solve the problem, three ways are adopted to test robustness: time robustness is evaluated by rolling with an extended window, as well as performance on multiple predicted horizons, and sensitivity to random initialization. These tests are used to see whether the ALA-VMD-CASA framework can maintain stable and reliable prediction performance under different market conditions, predicted horizons, and random training conditions.
3.4.1. Expanding-Window Walk-Forward Robustness Test
An expanding-window walk-forward evaluation procedure is adopted. The process begins with an initial training sample of length constructed from the earliest observations. The model is then used to generate forecasts for the following h-day horizon. Once the actual observations become available, they are appended to the training set, and the estimation window is moved forward by a step size s. This procedure continues iteratively until the entire testing period has been evaluated.
Extending the forward evaluation of window scrolling is to check the time robustness under market conditions. Different from the fixed partition of training and test sets, this method relies on gradually expanding the training samples by adding new observations. So this evaluation method is more in line with the actual prediction scenario, and the model is updated regularly with new information. The result in
Table 4 is that ALA-VMD-CASA has the lowest average prediction error in the rolling window, and the standard deviation of the error index is much smaller than that of the benchmark model. This reduced variability enables the proposed framework to stably make predictions at different times, not just perform well in specific market periods.
With the expansion of training samples, the performance advantage of ALA-VMD-CASA is relatively obvious, which means that it can absorb new information without obvious performance degradation or overfitting. In order to verify whether this advantage is the same in different evaluation windows,
Table 5 lists the win rate statistics based on MSE. ALA-VMD-CASA has a win rate of 1 in all rolling windows, while no benchmark model has the minimum MSE in a single window.
3.4.2. Multi-Horizon Forecasting Robustness
Evaluating models across different horizons can test robustness because as the horizon becomes longer, uncertainty and errors will increase.
Table 6 shows that the errors of all models increase from 1 step to 10 steps, which reflects the uncertainty of oil price dynamics. The error of ALA-VMD-CASA is still the lowest at each horizon. The advantage of ALA-VMD-CASA becomes stronger as the horizon extends, indicating that it can resist cumulative errors and perform well in short, medium, and long terms.
The strong multi-step forecasting performance of this framework is related to its decomposition–forecasting–reconstruction design. Before forecasting, the original time series is decomposed into different components to reduce the mutual influence between different time scales and avoid the impact of short-term errors on long-term forecasting. Therefore, the model can maintain structural consistency under different forecasting horizons, thus providing more reliable medium- and long-term forecasts in a volatile market.
To further examine the distributional characteristics of prediction errors across forecasting horizons,
Figure 12 presents boxplots of absolute errors for the 1-step, 5-step, and 10-step forecasts. Even so, the error distributions remain relatively compact, indicating that the proposed framework maintains stable predictive behavior even when forecasts extend further into the future.
3.4.3. Robustness to Random Initialization
Given that deep learning training processes and meta-heuristic optimization procedures may introduce stochastic variability, robustness with respect to random initialization is further examined. Under a fixed data partition, multiple independent experiments are conducted using different random seeds, and the mean and standard deviation of each error metric are reported. The corresponding results are presented in
Table 7.
In
Table 7, the performance differences between different seeds of ALA-VMD-CASA are relatively small. For example, the standard deviations of
, RMSE, and MAE are 0.000, 0.006, and 0.005, respectively, which indicates that the prediction performance is highly consistent. ALA-VMD-CASA has an advantage in the average error on the benchmark model. This shows that the improvement in performance is due to the stable model design and reproducibility, rather than relying on random initialization to take a chance.
4. Conclusions
This study explores the challenges in predicting Brent crude oil prices, including large uncertainties, multi-scale fluctuations, nonlinear dynamics, and irregular market fluctuations. The difficulties in overcoming methods to solve these problems lie in the sensitivity of signal decomposition parameters, the complex dynamics of multiple time scales, and the computational burden of long-sequence models. To address this, this study constructs a hybrid prediction framework, integrating ALA, VMD, and a prediction model based on CASA.
Empirical analysis of Brent crude oil spot price data shows that none of the benchmark models—including ARIMA, RW, RWWD, and various deep learning architectures—consistently outperform the random-walk/no-change benchmark. Only the ALA-VMD-CASA framework surpasses RW and RWWD, which strengthens the contribution of this study by demonstrating that the proposed model achieves genuine predictive advantages over both standard statistical and machine learning models and the appropriate random-walk-type naive benchmarks. The of this model is 0.986, and the MSE is 0.911, which is about 63.9% higher than the best single-model benchmark. The results of the ablation experiments show that the main components of the framework are complementary. ALA optimization makes the parameter selection in the decomposition stage tend to be stable. VMD splits the original sequence into multiple components and captures dynamic changes at different scales. The CASA module uses attention and linear components to enhance the model’s ability to identify local structural patterns.
Additional robustness tests confirm that the framework is reliable. There are cases of forward validation using rolling windows, using perfect winning rates, conducting multi-step forecasts (including 1-step, 5-step, 10-step), and repeating experiments with different random seeds. In various tests under different market conditions and prediction horizons, the model has always maintained stable and excellent prediction performance.
In the energy economy, more accurate forecasting and more stable error performance can make the model more effective in pricing in the changing crude oil market. Reliable forecasting can assist in carrying out risk management, investment decision-making, policy analysis and such affairs. Although the research results are quite exciting, there are still many directions for future research to explore. Perhaps what can be expanded is to apply this framework to high-frequency data, incorporate macroeconomic and geopolitical factors to make the economic explanation more powerful, and develop more computationally efficient optimization methods to promote large-scale pricing applications.