1. Introduction
Crude oil price serves as a crucial indicator of energy security and the macroeconomy. Long-term price dynamics reflect shifts in supply and demand, geopolitical developments, financial cycles, and the energy transition [
1]. The interplay among these factors renders the oil-price series complex, nonlinear, and multi-scale [
2]. Consequently, this study aims to develop a forecasting model assessed on an extensive dataset, which achieves high predictive accuracy and robust out-of-sample performance amid multiple simultaneous sources of uncertainty [
3]. Enhanced forecasts, evaluated over a lengthy historical period, can assist governments in optimizing energy security and macroeconomic policies, enable energy firms and financial institutions to hedge and manage risk, and provide more reliable inputs for planning related to the green transition and emission reduction.
Most studies adopt one of the two primary strategies. The first strategy employs linear statistical models, such as the AutoRegressive Integrated Moving Average (ARIMA) and Generalized Autoregressive Conditional Heteroskedasticity (GARCH), which assume quasi-stationarity and employ simple structures to characterize trends and volatility [
4]. While these models are interpretable and straightforward to implement, they exhibit poor performance in the presence of structural breaks and significant nonstationarity, resulting in accumulated bias in noisy and multiscale environments [
5]. The second strategy involves nonlinear AI models, including Back Propagation (BP), Support Vector Regression (SVR), and Long Short-Term Memory (LSTM), which leverage flexible function classes to capture complex dynamics and typically achieve reduced in-sample error. However, their extensive parameter spaces heighten sensitivity to noise and scale mixing, thereby compromising out-of-sample stability [
6]. Recent findings indicate that forecast stability, defined as the sensitivity of forecasts to minor changes in the information set or estimation window, represents a distinct criterion from average accuracy. This distinction underscores the need for explicit dispersion and stability objectives in rolling-origin evaluation [
7]. A prevalent compromise integrates decomposition techniques, such as Empirical Mode Decomposition (EMD), Complete Ensemble Empirical Mode Decomposition with Adaptive Noise (CEEMDAN), wavelets, and Variational Mode Decomposition (VMD), with these forecasting models in a decomposition–prediction–reconstruction pipeline [
8]. In numerous studies, reconstruction continues to rely on fixed or heuristic fusion rules, model selection is based on single-objective criteria, and the maintenance of the Pareto front is not explicitly addressed, collectively limiting the robustness and reproducibility.
To overcome the limitations in signal reconstruction, model search, and Pareto-front maintenance, we reorganize three core components of the decomposition–ensemble workflow. First, at the objective level, ensemble-weight selection is formulated as a bi-objective optimization problem that minimizes both average forecast error and error dispersion under a leakage-free rolling-origin protocol, thereby establishing a robust mathematical foundation for subsequent design. Second, at the reconstruction level, we define a VMD-based operator that adaptively reconstructs a raw time series by integrating intrinsic mode functions with weights derived from entropy and center frequency, while adhering to non-negativity and normalization constraints. Third, at the evolutionary-search level, we propose an elite-guided Crisscross Optimization method that employs tailored horizontal and vertical crossovers on weight vectors, along with a hybrid archive criterion that combines hypervolume (HV [
9]) with distance to an ideal point, to maintain a well-distributed Pareto front. All components are presented in a fully explicit form. The main contributions of this paper are summarized as follows.
An adaptive VMD-based reconstruction operator is proposed, which maps a normalized input series to a reconstructed signal by integrating Intrinsic Mode Functions (IMFs). This integration employs weights derived from entropy-based relevance scores and normalized center frequencies, adhering to non-negativity and unit-sum constraints. Each IMF is characterized by a brief vector of statistical indicators, which includes the absolute Pearson and Spearman correlations with the target, the maximal information coefficient, and the energy ratio. These indicator vectors undergo min-max normalization across IMFs and are combined using entropy weighting to generate relevance scores. Subsequently, the relevance scores are refined through a frequency penalty and normalized to produce the final fusion weights. This operator is model-agnostic and can function as a generic pre-processing module within other decomposition-ensemble frameworks.
Elite-guided horizontal and vertical crossover operators for ensemble weighting are introduced. These two crossover operators directly manipulate simplex-constrained ensemble-weight vectors. The horizontal operator generates convex combinations of the current solution, a peer, and a Pareto-elite individual. In contrast, the vertical operator updates a limited subset of weight dimensions to enhance promising patterns. Both operators incorporate annealed Gaussian perturbations, ensure feasibility through non-negativity enforcement and unit-sum normalization, and promote information exchange between elite and non-elite regions of the population throughout the search process.
Hybrid hypervolume–ideal-distance archive rule. A novel hybrid archive maintenance rule is introduced that concurrently considers HV to enhance coverage and diversity, as well as the distance to the ideal point to improve convergence. The resulting archive provides a well-distributed, high-quality approximation of the Pareto front. This update integrates a fixed-reference-point strategy and explicit termination criteria to ensure stable selection and reproducible optimization.
The rest of this work is arranged as follows. In
Section 2, the methodological foundations relevant to this study are reviewed, including decomposition–ensemble forecasting pipelines, ensemble-weight learning and forecast combination, and multi-objective evolutionary optimization with Pareto-front maintenance. In
Section 3, the proposed methodology is described in detail, covering the notation and problem formulation, the entropy–frequency VMD reconstruction operator, and the Multi-Objective Enhanced Crisscross Optimization (MOECSO) algorithm with elite-guided horizontal and vertical crossover operators and an archive update scheme based on a hybrid elite criterion. In
Section 4, the experimental design is specified, including the Brent dataset, the leakage-free rolling-origin protocol, evaluation metrics, statistical tests, competing baselines, and the definition of stress-period sub-samples. In
Section 6, the empirical results are presented and discussed, with emphasis on full-sample performance, robustness under market stress, and supporting statistical evidence. In
Section 7, the main findings are summarized, and directions for future research are outlined.
3. Methods
3.1. Notation and Problem Setting
Let denote a univariate time series, representing daily Brent crude oil spot prices in USD. One-step-ahead forecasting is evaluated under a rolling-origin walk-forward protocol. For each origin , a rolling window of length W is used, and the inner validation set is , while the standardization statistics are computed on so that is excluded from fitting-time normalization. Forecasts on are inverse-transformed to USD before residual evaluation.
Let
be the one-step-ahead forecast of base model
m using information up to time
, for
. Collect base forecasts as
. Ensemble fusion is conducted on the probability simplex
. For any
and
, with horizon
and origin
, define
Two objectives are jointly minimized on
using residuals on the USD scale:
where
. Here,
is MAE and
is the sample standard deviation of validation residuals, both in USD.
Proposition 1. If is finite, then the bi-objective optimization over admits at least one Pareto-optimal solution, hence is non-empty.
Proof. The feasible set
is non-empty, closed, and bounded, hence compact in
. For finite
, each
is affine in
and both
and
are continuous on
. A standard existence result for Pareto minimizers of continuous vector objectives over compact feasible sets implies that at least one Pareto-optimal solution exists [
22]. □
Proposition 2. For any , is a convex combination of . Moreover, the absolute error satisfies Averaging over yields In contrast, MAPE involves a time-varying normalization term and therefore is not written as an analogous linear upper bound; instead, it is evaluated using the stabilized denominator .
3.2. Overall Framework
Figure 1 presents a three-stage forecasting pipeline assessed using a leakage-free rolling-origin protocol. For each origin, all preprocessing, decomposition, scoring, and model fitting occur exclusively within the training window, while the held-out fold is designated solely for out-of-sample evaluation.
Data processing and reconstruction
The Brent price series is represented by . The training window is standardized, and the VMD technique is utilized to derive K IMFs, arranged based on their center frequencies. Subsequently, an innovation, denoted as Innovation #1, is implemented, introducing an entropy-frequency reconstruction operator. This operator assigns a relevance score to each IMF by employing a multi-indicator assessment that integrates entropy-based aggregation and a frequency regularization term. The resultant nonnegative weights are normalized and utilized for the reconstruction of a singular input signal in the standardized domain.
Heterogeneous base predictors
After reconstructing the input series
, a diverse set of base models is trained. Feature augmentation, like differencing or returns, is selectively applied to certain learning-based models, as illustrated by dashed links in
Figure 1, while ARIMA is fitted without augmentation. Each base learner generates one-step-ahead forecasts at every origin, collectively forming a forecast vector.
Bi-objective simplex ensemble via MOECSO
The final forecast is derived by optimizing the convex-combination weights across the base forecasts. The validation objectives include (1) MAE and (2) SSDVR. The proposed MOECSO solver consists of two primary components. The first component is an elite-guided horizontal and vertical crossover operator (Innovation #2), while the second component is an external-archive maintenance mechanism that utilizes a hybrid elite criterion (Innovation #3). The update process is executed through non-dominated sorting, followed by scoring, filling, and truncation. The algorithm generates a Pareto archive of candidate weight vectors, from which a single operational solution is chosen to produce the ensemble forecast.
3.3. Adaptive VMD Reconstruction Based on Multi-Indicator Fusion
This subsection formalizes the adaptive VMD reconstruction used to generate the input series for all the base models.
Figure 2 illustrates the overall entropy–frequency adaptive reconstruction pipeline. VMD is applied to the standardized input series
to extract
K band-limited components
, ordered by their center frequencies
from low to high. For each IMF, a compact set of relevance indicators is computed and rescaled across
k. These indicators are then fused via entropy-based weighting to obtain a per-IMF relevance score
. To reduce the contribution of noisy high-frequency modes, a frequency-aware penalty
is further introduced based on the normalized center frequency
. The preliminary score is defined as
. The scores are clipped at zero and normalized to sum to one, yielding fusion weights
[
23]. The reconstructed signal used in the subsequent stages is the resulting convex combination in Equation (
6):
3.3.1. VMD Decomposition
VMD is applied to the standardized input series
to extract
K band-limited components
such that the original series can be represented by their superposition:
where
denotes the
k-th mode at time
t and
K is the number of modes. The components are ordered from low to high according to their center frequencies
. Following the decomposition in Equation (
7), a compact set of relevance indicators is computed for each IMF and rescaled across
k before fusion.
3.3.2. Indicators
Four complementary relevance indicators are computed for each mode with respect to the standardized reference series :
Pearson correlationwhere, the sample covariance
and sample standard deviation
in Equation (
8) are computed over
.
Spearman correlationwhere the Pearson correlation in Equation (
9) is applied to the ranked samples. Ties are handled by standard average ranking in practice.
Maximum Information Coefficient [24]where
is the empirical mutual information and
is the empirical entropy computed from discretized samples in Equation (
10). MIC captures potentially nonlinear dependence. A mild upward bias may occur when the sample size is small.
Energy ratio [25]where the energy ratio
in Equation (
11) measures the relative energy contribution of
.
3.3.3. Multi-Indicator Fusion via Entropy Weighting
To treat positive and negative dependencies symmetrically, absolute correlations are used, and four raw indicators for each mode are defined as , , , and . Because these indicators have different scales, each indicator is min–max normalized across k to the unit interval before fusion. Indicators that are constant across IMFs are excluded from weighting to avoid degenerate normalization.
Entropy-based weights are then computed to emphasize indicators that provide higher cross-IMF discrimination [
26]. For each indicator
j, a probability mass function over IMFs is formed in Equation (
12) as
The entropy and the corresponding entropy weight are defined by
where
denotes the set of non-constant indicators after normalization in Equation (
13). The fused relevance score of
is finally computed as
where, if
is empty,
in Equation (
14) is set to
to allocate relevance uniformly across IMFs.
3.3.4. Frequency-Aware Scoring and Reconstruction
To suppress noisy high-frequency modes while retaining informative components, we combine the relevance score
with a frequency-aware penalty derived from the center frequency of each
. The center frequency is estimated in practice by the spectral centroid of the one-sided power spectrum:
where,
in Equation (
15) is obtained by FFT, and mirror padding can be applied to mitigate boundary effects.
The frequencies are normalized across modes within each rolling origin so that the normalized values lie in
, and a monotonically decreasing exponential penalty is used. The relevance–frequency fusion score in Equation (
16) is then defined by
with the smallest-frequency mode exempted by setting its penalty to one. Finally,
is truncated at zero and normalized to obtain nonnegative reconstruction weights
with
, and the reconstructed series in Equation (
17) is
3.3.5. Complexity and Implementation
All quantities used in the reconstruction, including and the frequency-based penalty, are computed on the fit segment at each rolling origin to avoid any leakage into the inner validation fold. The penalty strength is fixed at for all rolling origins, and the test fold is strictly excluded from any hyperparameter selection.
In terms of computational cost, Pearson and Spearman correlations and the energy ratio scale linearly with the segment length, yielding per rolling origin for K modes and N samples. Computing MIC for all modes scales as and is empirically close to in our implementation. For the Brent experiments, the per-origin training window length is , making MIC tractable relative to the FFT-based operations required by the VMD-related steps.
Definition 1. The operator for reconstructing the entropy–frequency VMD is defined as follows:where the mapping in Equation (18) is implemented by first decomposing the standardized input series into VMD components , each accompanied by a characteristic center frequency. Each IMF is associated with a relevance score and a frequency penalty . To reconstruct the series , the process involves calculating , , and , merging relevance and frequency details into initial scores, transforming them into positive unit-sum weights by clipping and renormalization, and ultimately creating the reconstruction through the weighted summation of the VMD components. Proposition 3. The reconstruction induced by under simplex normalization is feasible and continuous. Given VMD outputs , we define the fusion weights by Equation (19) with :where for all k. Moreover, the weights satisfy the unit-sum constraint in Equation (20), and, therefore, : Furthermore, the mapping is continuous on because it is a composition of continuous operations and its denominator is bounded away from zero by . Consequently, the reconstruction in Equation (21) varies continuously with respect to v: Remark 1. The interpretation of as a data-driven low-pass reconstruction suggests that it functions as a filter preserving a blend of VMD components without imposing strict frequency cut-offs or relying on a single indicator. The weights are determined by both the statistical significance indicated by entropy-based scores and the normalized frequency denoted by . Consequently, high-frequency elements are de-emphasized only when they exhibit weak correlation with the target, leading to more reliable reconstructions amidst the presence of noisy high-frequency elements. This approach is independent of any specific model and can serve as a general pre-processing step in various decomposition-ensemble frameworks.
3.4. Bi-Objective Ensemble-Weight Optimization via MOECSO
At each rolling origin , ensemble-weight selection is formulated as a bi-objective optimization problem over the simplex . Over the inner validation set , the ensemble forecast is obtained as a convex combination of the base one-step-ahead forecasts, and the validation residual is defined as the difference between the observed price and the ensemble forecast on the original USD scale. The two objectives jointly minimize (1) MAE and (2) SSDVR, thereby balancing accuracy and stability.
MOECSO updates candidate weights on the simplex through two elite-guided crossover operators. The horizontal crossover facilitates whole-vector recombination, promoting extensive exploration of . In contrast, the vertical crossover modifies only a small subset of coordinates, enabling fine-grained local refinement around the promising solutions. Objective evaluation follows the notation and problem setting defined above, and standardization is applied solely for base-model fitting within each origin.
To approximate the Pareto front, an external archive is maintained and updated using a hybrid criterion that balances diversity and convergence, measured by HV and the distance to the ideal point, respectively. An annealed schedule emphasizes HV in early iterations to enhance coverage and reduce premature convergence, and gradually shifts toward the ideal-point distance in later iterations to strengthen convergence. In terms of computational cost, both crossover and simplex repair incur an cost per individual, leading to an cost per generation for a population size of N.
3.4.1. Horizontal Crossover with Elite Guidance
The new candidate is formulated as an elite-guided linear recombination of the current solution, a randomly selected peer, and a Pareto-elite individual, augmented by Gaussian noise. When
, the operator is simplified as a standard horizontal crossover [
27], which exclusively mixes information between
w and
. Introducing a positive
provides explicit elite guidance, directing each update toward solutions along the current Pareto front while maintaining contributions from both the incumbent and the peer. This elite-guided crossover enhances convergence to high-performing regions of the weight space, mitigates the risk of deviating from the front when objectives are noisy, and sustains diversity through the inclusion of peer and noise components. A schematic illustration of the elite-guided crossover operators is provided in
Figure 3. The horizontal update is defined as
where
in Equation (
22) regulates horizontal mixing and is independent of the VMD frequency-penalty parameter
.
Meanwhile, to balance exploration and exploitation, a progress variable
is introduced, which increases linearly with the generation counter as follows:
where
in Equation (
23) represents the generation index. Here,
denotes the maximum strength of elite guidance, and
determines the initial magnitude of Gaussian noise. Initially, the noise is higher and elite guidance is weaker in the early iterations, whereas in later iterations the noise decreases gradually and elite guidance strengthens.
As a safeguard, a mild elementwise clipping step truncates each component of to the interval before enforcing non-negativity and unit-sum normalization, which mitigates rare numerical spikes when M is large or when previous weights are extreme while remaining sufficiently loose to leave typical perturbations essentially unaffected. The horizontal update performs full-vector recombination and applies the same simplex-repair procedure via non-negativity enforcement and unit-sum normalization, yielding a computational complexity of per individual and per generation.
3.4.2. Vertical Crossover with Elite Guidance
Complementing horizontal crossover that promotes global exploration, vertical crossover [
27] performs localized refinement by modifying only a small subset of dimensions while preserving the overall structure of the current solution, which is particularly effective in later iterations when subtle weight adjustments can escape shallow local optima more reliably than large global moves (see
Figure 3). Let
denote the current solution. Two distinct indices
are uniformly sampled from
without replacement, and an elite
e is uniformly drawn from the Pareto front
excluding
w. With
and
, only the selected dimensions are updated as follows in Equation (
24):
For all other indices , , so each move perturbs only two coordinates. The coefficient r induces a swap-like linear recombination between the selected components, the elite-guidance term pulls them toward the elite solution, and the annealed noise term improves robustness against premature convergence. To mitigate rare numerical spikes that may occur when M is large or when the previous weights are extreme, is first clipped elementwise to the interval , and then repaired to satisfy feasibility by enforcing non-negativity followed by unit-sum normalization. Although the raw update touches only a constant number of coordinates, the simplex-repair step requires time per individual, leading to an cost per generation for a population of size N.
3.4.3. Archive Update and Selection
This subsection describes how the Pareto archive is refreshed at each generation to retain a compact, non-redundant set of competitive solutions. At each generation, the previous archive A is merged with the offspring set O produced by the horizontal and vertical operators to form the candidate set . To remove redundancy, near-identical duplicates under floating-point tolerance are screened in the objective space using an infinity-norm tolerance: letting denote the bi-objective vector, if with by default, only one representative is retained.
The core of the archive update is the hybrid elite criterion, which ranks candidates by combining two measures: HV to promote diversity, and distance-to-ideal to promote convergence.
Figure 4 provides a schematic illustration of this hypervolume–ideal-distance hybrid elitism and the role of the adaptive mixing weight
. For each objective
, the HV reference point and the ideal point with safety margins are defined as
and
, where
and
are both set to
. Let
and
. Given the current archive
, the HV of a candidate
w is
where
in Equation (
25) denotes the HV measured with respect to the reference point
r, which is chosen to be dominated by all candidates in
C. To make the distance comparable across the objectives, each objective is normalized by its scale relative to the ideal point: define
and use
with
to prevent division by zero. The distance from
w to
in Equation (
26) is
At generation
g, each candidate
w receives the archive selection score
where, candidates with larger
in Equation (
27) are preferred. Ties are broken using crowding distance as in NSGA-II [
28]. The mixing weight
evolves as
, when
is larger and gradually shifts the emphasis toward convergence to the ideal point as
becomes smaller.
Archive filling and truncation are performed sequentially over the non-dominated fronts . The new archive is initialized as , and fronts are processed in increasing rank: if , all solutions in are appended to , otherwise, solutions in are sorted in a descending order of , the top solutions are appended to , and the procedure terminates. Finally, the archive is updated by setting .
For reproducibility, all scores are computed using the shared reference point r and the common ideal point determined from C, and the schedule parameters are fixed throughout the experiments, making the scoring function fully determined at each generation. In the bi-objective case, non-dominated filtering for the first front can be implemented via sorting in time, and HV evaluation together with HV computation can also be carried out in time using sorting-based routines, while the normalized distance-to-ideal computation is linear in the number of candidates. Overall, this design maintains both computational efficiency and reproducibility of the archive update process.
3.5. Overall Evolutionary Process and Termination
Algorithm 1 outlines the MOECSO optimization loop. Beginning with an initial feasible population, candidate weight vectors are iteratively refined through the elite-guided horizontal and vertical crossover operators. Following each update, candidates undergo a repair process to ensure compliance with non-negativity and unit-sum constraints, after which they are evaluated on the validation set
using the bi-objective functions
and
. Archive maintenance and selection are conducted according to the hybrid elite criterion. The process concludes at
, with the final archive
being returned as the Pareto set.
| Algorithm 1: MOECSO |
![Mathematics 14 00814 i001 Mathematics 14 00814 i001]() |
4. Experimental Setup
4.1. Data and Leakage-Free Evaluation
Daily Europe Brent spot price in USD from the U.S. Energy Information Administration table Europe Brent Spot Price FOB [
29] spans 20 May 1987 to 8 December 2025. Trading day observations are used without imputation. The task is a univariate one-step-ahead forecasting.
A leakage-free rolling-origin protocol is used [
1]. After burn-in
, each origin
uses a fixed training window
ending at
and forecasts
with
and
. The last
points in the window are used for validation, and the rest for fit. Z-score parameters
and
are derived from the fit segment only, standardize fit and validation, and forecasts are inverse-transformed to USD for error evaluation.
Ensembling uses simplex weights:
At each origin, is optimized on validation by MOECSO with population 120 and 150 iterations to minimize MAE and SSDVR in USD, then base learners refit on the full window, and the out-of-sample forecast uses the selected weights. No future observations enter scaling, fitting, or weight search.
Fixed configurations include ARIMA [
30], SVR [
31], ELM [
32], LSTM [
33], and BPNN [
34]. The seed equals 42 unless stated otherwise. Calendar-period statistics in
Table 1 support regime description only. Stress robustness restricts test origins to
under the same protocol and
W, and metrics aggregate out-of-sample points based on their variability, measured as standard deviation.
4.2. Evaluation Metrics and Statistical Tests
A leakage-free rolling-origin protocol evaluates out-of-sample accuracy on the original USD scale. With
and
denoting the realized and predicted Brent prices, the error is
. Accuracy metrics are
where
is the total number of test points and
avoids division by zero.
Significance is assessed by loss-based tests on squared errors. For baseline
b and proposed method
p, define
and
. The Diebold–Mariano statistic [
35] is
where
is the Newey–West HAC long-run variance with Bartlett weights [
36] and
. Negative DM implies lower mean squared error for the proposed method.
To correct for data-snooping across multiple competitors, Hansen’s SPA test [
37] is applied with MOECSO as the reference (0). For competitor
k, let
and
. The studentized statistic is
where,
is estimated by the same Newey–West rule and
p-values from a circular block bootstrap.
SPA
p-values are adjusted by Benjamini–Hochberg [
38]: for ordered
,
and significance is declared at 5% when
.
4.3. Computational Complexity and Scalability
Let
W denote the rolling window length,
V the inner validation length,
K the number of VMD modes,
M the number of base learners,
N the population size, and
G the number of MOECSO iterations. The per-origin runtime is approximated by
With FFT-based implementations,
up to an inner iteration constant. Feature extraction across all modes costs
for correlation and energy indicators, whereas MIC is empirically close to
. Base learner training contributes
. The multi-objective weight search evaluates simplex weighted combinations over
V points and
M learners with cost
per evaluation, so
Overall, runtime grows approximately linearly with M, N, G, and V, and grows close to due to decomposition and indicator extraction. Inference is a lightweight simplex aggregation.
To complement asymptotic complexity,
Table 2 reports an empirical benchmark under the same rolling origin protocol. The wall clock time is measured per origin and decomposed into decomposition, fitting, and weight search. Decomposition dominates the total runtime, especially for CEEMDAN, while the MOECSO search adds a small overhead under the current budget. A lightweight ridge projection baseline provides negligible optimization time but remains less accurate than MOECSO. A small-scale scalability check shows an increase in total runtime with larger
W and larger
M, which aligns with the
and
terms.
4.4. Competitive Baselines and Implementation Details
It follows from
Table 3 that the competing baselines exhibit a clear trade-off between training cost and online inference latency. ARIMA has the smallest size proxy at
, and its per-origin training time is
s while its inference time is
ms/step. SVR requires
s per origin, and its inference time is
ms/step. Among the neural forecasters, LSTM is the most computationally demanding with
parameters, a training time of
s per origin, and an inference time of
ms/step. BPNN also incurs nontrivial costs with
parameters,
s per origin, and
ms/step.
The proposed MOECSO pipeline concentrates its computational burden in the offline stage. Its end-to-end per-origin training time is s because it integrates VMD-based reconstruction, refitting all base learners, and weight optimization at each rolling origin. This added complexity does not translate into online latency. Once the base forecasts are available, MOECSO introduces only ms/step additional overhead for simplex-weighted aggregation. This overhead is of a magnitude smaller than the inference time of LSTM, which is ms/step, and it remains substantially lower than that of SVR, which is ms/step. These results indicate that MOECSO is well-suited to deployment settings that allow heavier offline updates while requiring fast one-step forecasts in real time.
4.5. Robustness Under Market Stress
Table 4 and
Figure 5 jointly indicate that the proposed MOECSO ensemble remains operationally robust under pronounced market stress while preserving stable tracking of the price trajectory. Stress-period evaluation follows the same leakage-free rolling-origin protocol as the full-sample experiment. At each origin, preprocessing, model refitting, and simplex-weight optimization use only the current training window, and the one-step-ahead forecast is recorded as a strictly out-of-sample prediction [
39]. A consistent per-origin budget is enforced across methods. LSTM and BPNN are trained for 25 epochs at each origin, and simplex weights are optimized at every origin using MOECSO with the population size
and
iterations.
Across all three crisis episodes, the ensemble delivers consistent one-step-ahead out-of-sample forecasts under the same fixed-length rolling-origin setting. During the 2008–2009 Global Financial Crisis segment, the average errors remain moderate with MAE being
and Root Mean Square Error (RMSE) being
. The tail-focused statistic MAE
95, equal to
, shows that extreme deviations are contained relative to the magnitude of the regime shift. This behavior is also visible in
Figure 5, where the ensemble forecast follows the overall decline and subsequent stabilization without persistent drift, and the largest discrepancies concentrate around abrupt transitions rather than accumulating over time.
The stress results further highlight the practical advantage of MOECSO in adapting to different disruption patterns without retuning the model family. The COVID-19 shock yields the largest dispersion, as reflected by RMSE of
and MAE
95 of
, consistent with the sudden collapse and rapid rebound in 2020. The forecast path still captures directional movements and avoids long-lasting bias. The 2022 energy-price spike exhibits a higher average MAE equal to
, together with a smaller MAPE equal to
, suggesting that the ensemble maintains relative accuracy at elevated price levels. Overall, the combination of stable trajectory tracking in
Figure 5 and controlled tail errors in
Table 4 supports that MOECSO provides a resilient fusion mechanism. By optimizing simplex weights at each rolling origin, MOECSO can shift emphasis among the base learners as market conditions change, thereby reducing sustained mis-calibration and limiting worst-case deviations during crisis-driven regime shifts.
6. Results and Discussion
Table 8 and
Figure 6 report full-sample out-of-sample performance under the leakage-free rolling-origin protocol. MOECSO achieves MAE
and RMSE
, remaining close to the best single baseline while outperforming ARIMA and BPNN. ARIMA yields MAE
and RMSE
, consistent with limited flexibility under nonlinear and regime-varying dynamics. BPNN yields MAE
and RMSE
, suggesting that standalone neural fitting without adaptive fusion is insufficient in this setting. LSTM attains the lowest single-model MAE and RMSE (
and
, respectively), while the numerical gap relative to MOECSO remains small, and the DM, SPA, and FDR results do not support a decisive statistical advantage under the rolling-origin protocol. MOECSO optimizes simplex constrained fusion weights under a joint accuracy and stability objective, so a slight increase in point error can occur when lower dispersion is prioritized across rolling origins. LSTM is a single learner that can be numerically favored when the local dynamics match its inductive bias, while MOECSO is still favored in practice because it reduces reliance on a single model class under nonstationary price dynamics [
40].
Figure 6 shows close trajectory tracking and no persistent drift on the representative test segment.
The rolling-origin evaluation produces a simplex weight vector
at each origin, so ensemble weights form a time-indexed outcome that reflects how forecast reliance shifts across market conditions. Weight distributions across calendar regimes and crisis-defined stress sets characterize the location and dispersion of
, offering economic interpretation for decision-making under distinct market environments. To summarize whether the ensemble behaves as a concentrated selector or a diversified combiner, a concentration index is computed:
A larger
indicates weight concentration on fewer models, while a smaller
indicates diversified averaging, linking the selected weights to an interpretable accuracy versus stability trade-off. Because the objectives penalize MAE and SSDVR jointly, regime and crisis shifts that change the feasible MAE to SSDVR trade-off induce systematic changes in
and in
, so the weights carry statistical and economic implications rather than acting as fixed coefficients. The stress-period evidence in
Table 4 supports this interpretation, since elevated dispersion and tail deviations in the 2020 interval align with stronger emphasis on stability, which is consistent with shifts in optimized weight profiles across origins.
7. Conclusions
This paper presents a leakage-aware decomposition-ensemble forecasting framework for Brent crude oil prices, wherein the final prediction is derived as a convex combination of heterogeneous base forecasters, constrained by simplex conditions. The framework consists of two main components. First, an entropy-frequency VMD reconstruction operator transforms multi-mode decompositions into a single task-adaptive input signal. Second, a bi-objective ensemble weight search is addressed using the multi-objective evolutionary algorithm MOECSO. This optimization operates on a probability simplex and simultaneously aims to achieve complementary validation objectives that balance mean accuracy with residual stability. To enhance search reliability, two algorithmic mechanisms are integrated. An elite-guided horizontal and vertical crossover facilitates efficient exploration, while a hybrid archive update ensures stable maintenance of a Pareto set.
The proposed method is evaluated using a leakage-free rolling-origin protocol, with performance measured by standard error metrics. Inference is supported by the DM and SPA tests, incorporating FDR control. In addition to full-sample comparisons, robustness is assessed on stress-period subsamples corresponding to the 2008–2009 Global Financial Crisis, the 2020 COVID-19 pandemic, and the 2022 energy shock. The analysis reports both mean error criteria and the tail-focused diagnostic . Results from the stress periods indicate that forecast errors cluster around regime shifts and volatility bursts, suggesting that robustness should be evaluated in terms of both average accuracy and tail risk.
Several limitations indicate directions for future research. First, while VMD-based reconstruction enhances robustness, the reconstruction operator could benefit from incorporating uncertainty-aware weighting or regime-conditioned scoring. Second, the current ensemble is based on fixed model families. Integrating modern sequence and mixer architectures and assessing their marginal benefits under the same leakage-free protocol would enhance generalizability. Third, weight optimization could explicitly focus on transaction-relevant objectives, such as directional accuracy during periods of high volatility, drawdown-aware loss, or asymmetric penalties, and could be extended to generate calibrated predictive intervals. Finally, repeated refitting remains computationally intensive. Implementing warm-starting, incremental updates, or surrogate-assisted optimization could decrease runtime while maintaining statistical rigor.