3.1. Experimental Environment and Data
The experiment was conducted on a Windows 11 64-bit operating system, utilizing an AMD Ryzen 7 8845 H processor with Radeon 780 M Graphics at 3.80 GHz and 16 GB of RAM. Python 3.11 was used as the programming language, and Matplotlib 3.6.3 was employed for data visualization.
This study uses the closing prices of metal futures as the primary dataset, with data sourced from
https://cn.investing.com (accessed on 18 March 2025) and shfe.com.cn (accessed on 15 March 2025). To assess the forecasting capabilities of the BH-TCN model across varying market environments and degrees of price volatility, gold futures data from both the Shanghai Futures Exchange (SHFE) and the New York Commodity Exchange (COMEX) were chosen as the focus of this study. The analysis focuses on the most liquid contracts traded on each exchange. The empirical dataset consists of daily closing prices across all continuous trading days from 2014 to 2024, as shown in
Figure 4. In this plot, the
x-axis corresponds to the trading day index, while the
y-axis shows the associated closing price.
For sample division, 80% of the data are allocated to training, with the remaining 20% reserved for testing. All subsequent model analysis and evaluation are based on this division. The original price series is first normalized and scaled to the [0, 1] range. Normalization facilitates faster updates of neural network parameters, thereby accelerating model convergence and enhancing both training efficiency and predictive performance. The normalization procedure is described by the following equation:
where
represents the original data, and
denotes the normalized value.
This study performs a descriptive statistical examination of the closing prices for SHFE gold (SHFE Au) and COMEX gold (COMEX Au) futures (see
Table 1). The SHFE gold futures exhibit relatively greater price volatility, as indicated by a higher standard deviation. Both datasets display positive skewness, suggesting that extreme price events above the mean are more frequent. In terms of kurtosis, SHFE gold futures show a light-tailed distribution, whereas COMEX gold futures exhibit a heavy-tailed distribution, implying a higher likelihood of extreme fluctuations in the latter. The Jarque–Bera (J-B) test yields statistically significant results at the 1% level for both time series, thereby rejecting the null hypothesis of normality and indicating that the price distributions exhibit non-Gaussian characteristics. The Ljung–Box Q (10) test reveals significant autocorrelation at the 10th lag for both markets, suggesting the presence of long-range dependence is observed. In summary, the closing prices of SHFE and COMEX gold futures exhibit non-normality, right skewness, light, or heavy tails, and significant autocorrelation—providing a statistical foundation for subsequent time series modeling.
To evaluate whether the closing price series of SHFE and COMEX gold futures are stationary, the Augmented Dickey–Fuller (ADF) unit root test is applied. The p-values obtained for the SHFE and COMEX series are 0.9987 and 0.9949, respectively, both significantly exceeding the standard significance thresholds of 1%, 5%, and 10% (refer to
Table 2). In addition, the corresponding ADF test statistics, 2.0484 and 1.0648, exceed the critical values at all standard significance levels. The results of the analysis imply that the null hypothesis of a unit root cannot be rejected, indicating that the price series of both SHFE and COMEX gold futures exhibit non-stationary behavior.
3.2. Decomposition Results
The original time series is decomposed using the BH algorithm, with its parameters optimized via a two-stage grid search to achieve a more reasonable number of decompositions and improved decomposition performance. The number of control points is chosen to strike an optimal balance between overfitting and underfitting within the model. In the first (coarse) stage,
is searched over [0.1, 0.5] with a step size of 0.1, and the best performance is obtained at the boundary (
= 0.1). Therefore, a second (refined) search is conducted over [0.01, 0.1] with a step size of 0.01, which yielded the final optimal value
= 0.01. Model selection is performed by minimizing the
MSE on the dataset. The optimal decomposition results for the metal futures prices are shown in
Figure 5. Effective decomposition of the original dataset facilitates the extraction of meaningful components, which in turn reduces the complexity of the data. The resulting sub-series exhibit clear linearity, trend patterns, and low complexity, which provide suitable inputs for subsequent TCN-based forecasting. Meanwhile, the final residual sequence approximates white noise, indicating that the decomposition process has effectively removed noise components and enhanced the overall accuracy and stability of the modeling.
3.3. Hyperparameter Selection
In many predictive models, both the input window length and the number of training epochs play crucial roles in determining the model’s performance [
42]. In this context, the term “window” denotes the length of the time steps in each input data sequence, whereas “training epoch” refers to the total number of times the model processes the entire training dataset. Although properly adjusting the window size and the number of epochs can enhance prediction accuracy, an excessive number of parameters does not necessarily yield better results; instead, it may cause the model to transition rapidly from underfitting to overfitting. To select hyperparameters reasonably, this study employs a grid search method for model tuning [
43]. This approach begins by specifying a range of possible values for each hyperparameter, which are then combined to create multiple candidate configurations. The model is trained under each configuration, and its effectiveness is assessed using the validation dataset. Ultimately, the optimal parameter combination is chosen as the final configuration for the model.
Table 3 presents the hyperparameter selections for the various models. This procedure significantly improves both the stability and accuracy of the model’s predictions.
To guarantee a fair and consistent comparison of performance, the hyperparameters of both the proposed BH-TCN model and the benchmark methods are appropriately configured and optimized. Bézier curve fitting and wavelet denoising are introduced as preprocessing steps to construct the Bézier-TCN and wavelet denoising-TCN (WD-TCN) models, respectively, aiming to denoise the original time series and enhance feature representation. Specifically, Bézier fitting follows the uniform 10-segment piecewise strategy implemented by Zhao [
32], performing a uniform 10-segment, fifth-order Bézier curve fitting to smooth out noise components in the raw data. For wavelet denoising, the optimal wavelet basis function and decomposition level are selected based on the principles of maximum signal-to-noise ratio (SNR) and minimum
MSE, effectively capturing the primary features of the signal.
For conventional time series models and machine learning models, the parameter selection methods are as follows.
In the ARIMA model, the autoregressive order , differencing order , and moving average order are selected by minimizing the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) within an appropriate search range, thereby balancing model fit and complexity.
For the SVR model, the penalty parameter is determined through grid search within the range [0.1, 1, 10], while the insensitive zone is optimized within the interval [0.01, 0.1]. The radial basis function (RBF) is employed as the kernel function.
Similarly, the XGBoost model’s hyperparameters are tuned using a grid search strategy. The number of weak learners is searched within the range [500, 1000], while the is set within [6, 10, 15], the is selected from [0.001, 0.005, 0.01], and both the subsample ratio and column sampling ratio are optimized within the range [0.8, 1.0].
3.4. Predictive Performance
Initially, the
RMSE,
MAE, and
MAPE metrics are computed for each model to assess the extent of prediction errors (refer to
Table 4).
Figure 6 provides a visual comparison of these error metrics across all models.
As demonstrated by the experimental results in
Table 4, the BH-TCN model consistently outperforms other methods, delivering superior predictive accuracy across both SHFE Au and COMEX Au datasets. In the SHFE Au dataset, the
RMSE,
MAE, and
MAPE values are 5.4694, 3.9083, and 0.7644, respectively, markedly outperforming the other benchmark models. In the COMEX Au dataset, the BH-TCN model similarly exhibits superior performance, with
RMSE,
MAE, and
MAPE values of 21.4865, 16.6375, and 0.7708, respectively. This demonstrates the model’s strong generalization capability. In summary, the BH-TCN model outperforms ARIMA, SVR, XGBoost, LSTM, and TCN across
RMSE,
MAE, and
MAPE metrics. Moreover, compared to other TCN-based improved models (Bézier-TCN, WD-TCN), BH-TCN also exhibits a significant advantage.
Figure 6 presents the histogram results of main models based on
MSE,
MAE, and
MAPE, providing a visual comparison. It can be observed that the histogram height of the BH-TCN model is the lowest across all metrics, indicating that it achieves the best results with minimal prediction errors. This confirms the effectiveness and superior predictive capability of the BH-TCN model for metal futures prices. The
of the BH-TCN model on the SHFE Au is 0.990741, and the
of the BH-TCN model on the COMEX Au is 0.994525, both outperforming other comparative models and demonstrating stronger fitting ability (see
Table 5). Additionally, the model obtained
values of 0.542056 and 0.532710, ranking at a relatively high level among all methods, indicating greater stability in prediction results. In contrast, models such as Bézier-TCN, WD-TCN, standard LSTM, and TCN exhibited slightly inferior performance in both
and
metrics, suggesting potential instability when handling complex prediction scenarios. Conventional time series and machine learning models demonstrated significantly weaker performance compared to deep learning and hybrid models on both datasets, with limited fitting ability and lower stability. In comparison, the BH-TCN model more effectively captures the complex dynamic changes in time series, maintaining stable and low prediction errors, thereby exhibiting superior cross-market generalization capability.
To further confirm the superior forecasting capability of the BH-TCN model relative to alternative approaches, this study employs the MDM test utilizing loss functions such as
MSE,
MAE, and
MAPE. The corresponding outcomes are presented in
Table 6. According to the MDM test results, all competing models reject the null hypothesis at the 1% significance threshold when compared against the BH-TCN model. This demonstrates a statistically significant difference in predictive accuracy between the BH-TCN model and its counterparts. In essence, the BH-TCN model consistently delivers more precise and dependable predictions than the other evaluated models. This result further corroborates the outstanding performance and robustness of the BH-TCN model in commodity price forecasting tasks from a statistical validation perspective.
The results indicate BH-TCN demonstrates the best overall performance across all datasets and evaluation metrics. Except for one value, which is slightly lower than that of the WD-TCN model, all other metrics achieve optimal results. The BH-TCN model not only outperforms other comparative methods in terms of but also maintains a leading position in , reflecting its strong predictive stability and robustness. Furthermore, the model exhibits excellent generalization ability under various market conditions, effectively adapting to diverse data structures and fluctuation patterns.
The model’s prediction outcomes are depicted in
Figure 7. Specifically,
Figure 7a illustrates the forecasted trend of gold futures prices on the SHFE test dataset spanning the years 2014 to 2024, while
Figure 7b illustrates the predicted trend on the COMEX test set over the same period. In both figures, the
x-axis corresponds to the number of data points in the test set, while the
y-axis denotes the closing price. The mazarine line represents the predicted values from the BH-TCN model, while the crimson line shows the actual values. It is evident that the BH-TCN model’s predictions outperform those of the other models, whose forecasts exhibit larger deviations from the actual values. Notably, during periods of sharp price fluctuations, the discrepancy between predicted and actual values is more pronounced for the other models, while the BH-TCN model consistently approximates the actual values with greater precision.
The superior performance of the BH-TCN model in capturing sudden price changes can be attributed to its architectural design and multi-scale decomposition strategy. Specifically, the Bézier–Hurst (BH) decomposition isolates trend components at different scales, effectively filtering noise while preserving abrupt local variations. When these multi-scale sub-sequences are fed into the TCN, the dilated causal convolution layers expand the receptive field without increasing computational cost, allowing the model to leverage long-range dependencies and detect sudden shifts simultaneously. Additionally, the hybrid structure of the BH-TCN enhances its adaptability to nonlinear and volatile market dynamics. Consequently, the model not only achieves lower overall prediction errors but also demonstrates increased robustness and sensitivity during periods of sharp price fluctuations, outperforming standard TCN, LSTM, WD-TCN, and Bézier–TCN approaches.
3.5. Robustness Check
To further assess the robustness of the proposed model across different training–test split ratios and varying sizes of prediction datasets, this study performs additional empirical analyses. Initially, the first 90% of the original sequence is allocated as the training set, with the remaining 10% were reserved for testing. The MDM test outcomes for this setup are provided in
Table 7, Panel A. Subsequently, the dataset is partitioned such that the first 70% constitutes the training set, while the final 30% is used for testing. The MDM test results corresponding to this configuration are displayed in
Table 7, Panel B.
As shown in
Table 7, Panel A, under the 90%–10% training–test split, the null hypothesis is rejected at the 1% significance level in 39 instances and at the 5% level in 3 instances. These results demonstrate that the BH-TCN model significantly outperforms the majority of the comparative models.
Similarly, in
Table 7, Panel B (70%–30% training–test split), the initial hypothesis is rejected at the 1% significance level in 38 cases, providing additional evidence that the BH-TCN model retains robust predictive performance despite a reduced training set size.
In conclusion, through validation under different training–test split ratios and varying prediction sample sizes, it can be inferred that the BH-TCN model demonstrates notable stability, high adaptability, and superior predictive accuracy, demonstrating clear advantages over other models.