1. Introduction
This study investigates the forecasting performance of the seasonal autoregressive integrated moving average (SARIMA), generalised autoregressive conditionally heteroscedastic (GARCH), general regression neural network (GRNN) and artificial neural network (ANN)-based extreme learning machine (ELM) for the South African gold sales series. According to [
1], the gold mining industry in South Africa has long been a cornerstone of the nation’s economy, historically serving as a primary driver of economic growth, employment, and export revenue. Nevertheless, the industry is characterised by intrinsic volatility, shaped by a complex interaction of elements such as changing global commodity prices, increasing extraction depths, escalating operational expenses, and shifting socio-political environments [
2]. Thus, precise predictions of gold sales are crucial for strategic planning, investment choices, and policy development to maintain the sector’s stability and its ongoing contribution to the national economy [
3].
Traditional statistical models, such as the SARIMA and GARCH, have been widely used for time-series forecasting in commodity markets [
4,
5]. While SARIMA effectively captures linear structures and seasonal patterns, GARCH models are proficient at modelling time-varying volatility conditional variance with constant parameters and clustering effects that often exist in economic data [
6]. GARCH models are computationally efficient, easy to estimate, and widely understood, making them particularly suitable for policy-oriented and applied forecasting studies where transparency and interpretability are essential. In contrast, the generalised autoregressive score (GAS) proposed by [
7] relies on the specification of a full conditional density and score-driven dynamics, which can increase model complexity and estimation uncertainty, especially in relatively low-frequency monthly data. However, a significant limitation of these classical techniques is their presumed linearity, which may render them insufficient for capturing the complex, nonlinear relationships inherent in gold market dynamics [
8].
In recent years, sophisticated machine learning (ML) techniques have emerged as powerful alternatives, demonstrating remarkable success in modelling nonlinear and complex systems [
9]. Techniques such as GRNN and ANN-based ELM offer significant advantages [
10,
11]. These models can autonomously learn complex patterns and nonlinear dependencies from historical data without requiring pre-specified relationships, potentially leading to more accurate and reliable forecasts in financial and resource economics [
12].
Against this backdrop, this study presents a comparative investigation into the forecasting performance of a variety of models for the South African gold sales series. Specifically, this study examines and contrasts the predictive accuracy of the traditional SARIMA and GARCH models with the advanced ML capabilities of the GRNN and ANN-based ELM. By training and rigorously evaluating these models on historical data, this study aims to determine the most reliable and effective methodology for forecasting future trends in this critical sector. This study is among the first to apply and compare these advanced neural network architectures alongside traditional time-series models for forecasting South African gold sales, thereby contributing valuable insights for stakeholders in the mining industry and financial markets.
The rest of the study is structured as follows:
Section 2 presents the literature review,
Section 3 discusses the methodology,
Section 4 is a discussion of findings and
Section 5 presents the conclusion and recommendations.
3. Methodology
This section describes the dataset, preprocessing steps, and methodological framework employed in this study.
3.1. Data Preparation and Preprocessing
3.1.1. Data
The study employed monthly gold sales time-series data ranging from January 2003 to July 2024, with 259 observations sourced from Statistics South Africa (StatsSA). The data is publicly available and accessed on 25 February 2025 at
https://www.statssa.gov.za (access on 12 March 2026): Table P2041: Mining Production and Sales, series MVK24000: Mineral sales according to mining divisions, mineral groups and minerals, Gold. SARIMA, GRNN, and ANN-based ELM have been used individually in different studies to model the linear and nonlinear characteristics of time-series data. The use of GRNN and ANN-based ELM is justified over ML and other deep learning models for several theoretical and practical reasons. First, GRNN and ANN-based ELM are well suited for small-to-medium-sized datasets, such as monthly South African gold sales data. Second, GRNN and ANN-based ELM possess fast training speeds and simpler network architectures, reducing computational cost and the risk of overfitting compared to other deep learning models with multiple hidden layers and numerous hyperparameters. Lastly, these models are effective at capturing nonlinear relationships and complex patterns in time-series data without the need for extensive parameter tuning or long training horizons. The use of the SARIMA model is justified when compared with other seasonal time-series models due to its flexibility, statistical robustness, and strong forecasting performance in modelling stochastic seasonal processes [
4]. Furthermore, SARIMA remains widely recognised as a benchmark seasonal forecasting model due to its parsimony, interpretability, and strong statistical foundation. This study suggests employing linear and nonlinear models to investigate their performance in modelling gold sales time-series data.
3.1.2. Data Split and Normalisation
The dataset was split into two subsets, namely, 80% for training and 20% for testing. The train–test split was chosen because it strikes an optimal balance between model training and testing evaluation. Allocating 80% to training ensures that there is enough historical data to reliably capture the underlying patterns, seasonality, and volatility in the gold sales series. At the same time, reserving 20% for testing provides a sufficiently large hold-out sample to robustly assess the models’ predictive performance on unseen data [
28,
29].
The target variable, monthly gold sales, was normalised using Min–Max scaling in the range [0, 1] for ML models. To ensure stationarity, the series was differenced before fitting the SARIMA and GARCH models. For comparability, the same differenced series was used to train the neural network models. Additionally, normalisation was applied only to the neural network models, as ML algorithms are sensitive to feature scales, whereas SARIMA and GARCH models are not.
Table 1 summarises the data handling and preprocessing for each model.
3.1.3. Model Architecture
The GRNN was implemented as a non-parametric kernel-based neural network for time-series prediction. GRNN estimates the conditional expectation of the target variable using a Gaussian kernel function based on the distance between training observations and test observations. Given an input vector, x, the GRNN prediction is computed as a weighted average of observed outputs, where the weights are determined by the Gaussian spread parameter (σ). In this study, the spread parameter was set to 0.1 to control the smoothness of the regression function. Since GRNN relies on distance calculations, the input variables were normalised prior to model training to improve numerical stability and ensure comparable feature magnitudes.
The ANN-ELM model was implemented as a hybrid two-stage neural architecture combining ANN and an ELM. First, an ANN with two hidden layers (10 and 5 neurons) was trained using backpropagation to capture nonlinear patterns, and its predictions were used as inputs to the ELM to enhance predictive performance through model stacking. The ELM randomly initialised hidden-layer weights and biases, computed hidden-layer outputs using a ReLU activation function, and analytically estimated output weights using Ridge regression, resulting in faster training and improved numerical stability compared to traditional neural networks.
3.1.4. Software
The analysis was conducted using Python version 2022.3.3 software, and the details of the models are discussed in the following subsections.
3.2. Seasonal Autoregressive Integrated Moving Average (SARIMA) Model
For the past three decades, ARIMA models have been widely used in numerous fields for time-series forecasting. Introduced by [
4] in the early 1970s, the ARIMA model has become a proven and reliable method for predicting time-series data [
4]. The general form of ARIMA
is given as follows:
where
are polynomials in terms of
degrees of freedom and
and
respectively;
and
represents the backward shift operator. The seasonal ARIMA
is an extension of the ARIMA model which aims to improve the performance of the ARIMA model in modelling and predicting time series with seasonal effects. It is multiplicative in nature. The mathematical representation of SARIMA is given as follows:
The SARIMA model is a widely used time-series forecasting method that consists of seven key parameters, namely, : the number of autoregressive (AR) terms; : the degree of differencing required to make the series stationary; : the number of moving average (MA) terms; : the seasonal AR lags; : the degree of seasonal differencing; Q: the seasonal MA lags; and : the length of the seasonal cycle. and are seasonal AR and MA polynomials of P and Q respectively. and denote the nonseasonal and seasonal differencing operators respectively.
The Box–Jenkins approach used in this study consists of three iterative stages: model identification, parameter estimation and diagnostic testing.
- I.
Model identification
The initial step in the Box–Jenkins methodology is model identification, which focuses on assessing whether the time series is stationary and determining the necessary level of differencing if required. Stationarity is a key assumption for ARIMA/SARIMA models, as failing to achieve it can result in inaccurate forecasts. A time series is deemed stationary when its statistical characteristics, including mean and variance, remain consistent over time [
4]. To evaluate stationarity, methods such as the visual inspection of time-series plots, statistical tests such as the Augmented Dickey–Fuller (ADF) test pioneered by [
30] and outlined by [
13], and autocorrelation function (ACF) plots are frequently employed. If the time series exhibits non-stationary behaviour, techniques like differencing or logarithmic transformations can be used to stabilise its statistical properties.
Ref. [
30] proposed the Augmented Dickey–Fuller (ADF) test as a formal approach to detect the presence of a unit root in a time series. Subsequently, in 1992, Kwiatkowski, Phillips, Schmidt, and Shin introduced the KPSS test as an alternative or complementary method to the ADF test, offering a different perspective on stationarity by testing trends and levels of stationarity [
31]. This study employs visual plots as well as both the ADF and KPSS formal tests to assess stationarity in the time-series data. Furthermore, ACF and PACF are analysed to determine possible AR and MA components, aiding in the selection of appropriate model parameters.
- II.
Parameter estimation
After identifying the appropriate model structure, the next phase is estimation, which involves determining the parameters of the ARIMA model. This is commonly achieved through methods such as Maximum Likelihood Estimation (MLE), which identifies the parameter values that maximise the likelihood of the observed data. Once the parameters are estimated, model selection is conducted using statistical criteria such as the AIC and BIC. These metrics assist in comparing models by weighing the trade-off between goodness of fit and complexity, with lower AIC or BIC values indicating a more suitable model. Choosing the best model at this stage is essential for ensuring precise forecasts while preventing overfitting.
- III.
Diagnostic testing
Diagnostic testing in the Box–Jenkins methodology focuses on evaluating the statistical properties of the error terms, specifically the normality assumption and the weak white noise assumption. These assumptions ensure that the residuals (errors) from the fitted model do not exhibit any predictable patterns or structures. To assess the model’s overall validity, the Ljung–Box test proposed by [
32] is commonly used. This test examines whether the residuals are independently distributed, effectively checking for any remaining autocorrelation. If the Ljung–Box test indicates significant autocorrelation, it suggests that the model has not fully captured the underlying data patterns, and further adjustments to the model may be required. The hypotheses to be tested are as follows:
H0. The model is adequate.
H1. The model is inadequate.
The test statistic for the Ljung–Box test is given as follows:
where, in
,
denotes the number of observations and
is the degree of freedom of nonseasonal differencing used to transform the original series into stationary. The
denotes the square of the autocorrelation of the residuals at lag
[
33].
3.3. Generalised Autoregressive Conditionally Heteroscedastic (GARCH) Model
In 1982, ref. [
5] introduced autoregressive conditional heteroscedasticity (ARCH) models to account for the time-varying volatility frequently observed in economic and financial time-series data. Later, in 1986, ref. [
6] expanded on this concept by developing generalised autoregressive conditional heteroscedasticity (GARCH) models, which efficiently model the dynamics of conditional heteroscedasticity as a variance process. In GARCH models, the squared volatility,
, is influenced by both past squared volatilities and past squared values of the model. This characteristic makes them a generalised form of ARCH models. The GARCH
process is given as follows:
where
,
,
, and
to ensure that the conditional variance remains positive. The constraint on
also ensures that the unconditional variance in the
is finite while its conditional variance
changes over time [
34]. Suppose
from GARCH
then the random variable
has a
model if the following is true:
The parameters of a GARCH model can be estimated using the Maximum Likelihood Estimation (MLE) method, which optimises the likelihood function to find the values that best fit the observed data.
3.4. Generalised Regression Neural Network (GRNN)
The general regression neural network (GRNN) was first introduced by [
10] in 1991 and is widely recognised for its advantages as a meta-modelling algorithm. The GRNN model is based on nonlinear regression theory, which allows it to model complex, nonlinear relationships between inputs and outputs using a non-parametric approach. According to [
35], GRNN is based on non-parametric regression principles, relying on sampled data and employing Parzen non-parametric estimation to determine network output using the maximum probability principle. Additionally, unlike backpropagation-based methods, GRNN does not require an iterative training process. GRNN excels in nonlinear approximation.
According to [
36], a GRNN consists of four layers: input, pattern, summation, and output. In the input layer, data are taken in by a number of observed parameters equal to the number of input units. The input layer receives data through multiple observed parameters, corresponding to input units. The pattern layer stores training patterns, while the summation layer contains two types of neurons: single-division neurons, which connect to the pattern layer, and summation neurons, which link to the output layer. The hidden and output layers utilise radial basis and linear activation functions, respectively. Each hidden neuron corresponds to one training pattern, which allows the network to perform non-parametric regression and estimate the probability density function of the underlying data. Learning in the GRNN is instantaneous. Finally, the output layer normalises the output by dividing the output of each S-summation neuron by the output of each D-summation neuron, producing the predicted value
for the given unknown input vector
computed as follows:
where
where
represents the training pattern numbers;
represents the weighted connection between the
ith pattern layer neuron and the S-summation neuron; the Gaussian function is denoted by
;
denotes the number of input vector elements; and
and
are the
jth element of
and
, respectively. The optimal value of the spread parameter (
) is determined experimentally. One of the key advantages of GRNN is its rapid learning capability and its ability to achieve an optimal regression surface as the sample size grows. This makes GRNN particularly useful in real-time applications with limited data, as it can quickly establish the regression surface even with a small number of samples [
37]. Shapley additive explanations (SHAP) were employed to quantify the contribution of each lagged input to the model’s predictions. This approach enables interpretability of nonlinear temporal dependencies captured by the GRNN, identifying dominant short-term and seasonal memory effects. The structure of the GRNN architecture is visually represented in
Figure 1, providing a schematic overview of its four key layers: input, pattern, summation, and output.
3.5. Extreme Learning Machine (ELM)
An artificial neural network (ANN), inspired by the human nervous system, is a widely used tool in artificial intelligence, particularly for tasks such as prediction, pattern recognition, and classification [
38]. According to [
38], the performance of ANN-based techniques heavily depends on the careful tuning of key parameters, including the number of hidden layers, nodes, weights, and the choice of transfer function. However, extensive research and practical applications have revealed certain limitations of this approach [
39]. Ref. [
11] highlighted several drawbacks associated with traditional ANN methods, such as long computation times, difficulties in determining stopping criteria, challenges in managing the learning rate and epochs, susceptibility to local minima, and the need for extensive fine-tuning.
To overcome these limitations, ref. [
40] introduced a novel learning algorithm designed for SLFN, known as the extreme learning machine (ELM). In this approach, input weights and hidden biases are randomly assigned, while output weights are determined analytically using the Moore–Penrose (MP) generalised inverse method. In this study, each training sample consisted of the normalised gold sale together with its corresponding time index as the input feature vector, while the target variable was the normalised gold sale. Prior to model training, the data were normalised to ensure comparable feature magnitudes and improve numerical stability of the hidden layer activations. Given a training dataset of NN unique samples
, the output with zero error for the SLFN with
hidden neurons can be expressed as follows:
where
is the input weights;
denotes the weights connecting the hidden-to-output layer; and
is the biases in the hidden layer. The matrix representation of the
equations in Equation (9) is given as follows:
where
Since the weights
and biases
are assigned randomly, the weight vector
is the only parameter that needs to be estimated. However, the structure of the hidden layer output weight matrix
depends on the data sample and
meaning that Equation (10) may not always hold. As a result, estimating
is essentially reformulated as a least squares optimisation problem, expressed in the following form:
Ref. [
41] stated that, according to optimisation theory, the solution that minimises the objective function
is given as follows:
where
which is known as the MP generalised inverse (also called Pseudo inverse) of
. The key difference between ELM and traditional neural network approaches is that, in ELM, there is no need to fine-tune all the parameters of the feedforward network, such as the input weights and hidden layer biases [
42]. The number of hidden neurons and activation/spread selection help control model complexity. SHAP was also computed to interpret the nonlinear forecasts by quantifying the contribution of each lagged input. This approach provides insight into the temporal memory structure learned by the ELM, distinguishing between short-term persistence and seasonal effects.
Figure 2 depicts the schematic structure of ELM.
3.6. Evaluation of the Forecasting Performance of the Models
In this study, evaluation metrics are employed to gauge the effectiveness of the proposed models. These metrics are RMSE, MAE, MAPE, MFE and Theil’s U. The metrics are computed using the following equations respectively:
where
denotes the actual value and
represents the predicted values of the gold sales while
is the total number of observations. Furthermore, the Diebold–Mariano (DM) test developed by [
43] was employed to statistically compare the forecast accuracy of traditional time-series forecasting models with the ML techniques employed in the study to test whether differences in their forecasting performance are statistically significant. The DM test statistic is computed using the following formula:
where
is the average loss differential between the two models;
represents the estimated autocovariance of the loss differential at lag
k;
M is the truncation lag (also called the bandwidth); and
T denotes the number of forecasts (sample size).
4. Discussion of Findings
This section presents an analysis of the study’s findings, with the results illustrated through tables and figures.
4.1. Explanatory Data Analysis (EDA)
The EDA was performed to understand the characteristics of the dataset. The results are presented in
Table 2.
The gold sales dataset consists of 259 observations with a mean value of 5602.078 and a median value of 5114.900, which indicates that the data is slightly skewed to the right, as the mean is greater than the median. The minimum value observed is 1477.10, while the maximum reaches as high as 20,492.500, which reflects a wide range of values. The standard deviation of 2997.16, along with the high variance of 8,982,938.18, indicates significant variability in the dataset, suggesting that the data points are widely spread out around the mean.
Figure 3 provides a visual representation of the gold sales.
As shown in
Figure 3, the gold sales plot appears to be non-stationary, showing noticeable fluctuations over the sample period. Visual inspection suggests that the series is non-stationary. To confirm this, a formal stationarity test was conducted, with the results detailed in
Table 3.
The results in
Table 3 demonstrate that the
p-value of the ADF test at level is 0.985, which indicates non-stationarity since it is greater than the 0.05 significance level. Similarly, the KPSS test at level yielded a
p-value of 0.010, suggesting non-stationarity as well. However, after first differencing, both tests confirmed stationarity. The ADF test reported a
p-value of 0.000, while the KPSS test yielded a
p-value of 0.100, which is above the 0.05 threshold, indicating that the study failed to reject the null hypothesis of stationarity. Overall, these results confirm that the time series is integrated of order 1,
I (1).
4.2. Results of the SARIMA Model
In this section, the ACF and PACF are used to identify the appropriate order for the time-series model. The ACF and PACF plots are presented in
Figure 4.
According to
Figure 4, the identified competing SARIMA models were
,
,
and
. Using the AIC,
was deemed most suitable for gold sales. The parameter estimates for this model are summarised in
Table 4.
The results in
Table 4 reveal that all the variables in the model are statistically significant, as all their
p-values are below the threshold of 0.05. The MA(1) variable, with a coefficient of 0.797, has a strong impact on the model, highlighting the role of past errors at lag 1. The model also reveals that seasonal effects are important, with seasonal AR(12) and seasonal AR(24) capturing yearly and two-year seasonal patterns, with coefficients of 1.546 and 0.637, respectively. The seasonal MA(12) and seasonal MA(24) terms, with coefficients of 1.515 and 0.698, indicate a strong positive effect on the model from errors at lag 12 and 24. Finally, the significant Sigma2 value of 0.0671 confirms that the model accurately captures the error variance, ensuring a robust fit. Overall, the results suggest that the
model successfully incorporates both autocorrelation and seasonality, making it effective for capturing complex temporal patterns in the data. The results of the diagnostic tests for the fitted SARIMA model are summarised in
Table 5.
The results summarised in
Table 5 from the JB test show that the residuals are not normally distributed, as the
p-value is less than 0.05. Additionally, the Ljung–Box Q test results, with a
p-value of 0.590, exceed the 0.05 significance level, suggesting that there is sufficient statistical evidence to support the adequacy of the
model.
Figure 5 displays the observed versus the fitted values of the SARIMA model.
Figure 5 shows the observed as well as fitted values of the SARIMA model. As evident from the graph, the SARIMA model captures the general trend and direction of returns, but it smooths out extreme spikes and sharp fluctuations, underestimating some of the higher volatility periods. The close alignment in most periods indicates that the model fits reasonably well in stable market conditions, though it struggles to fully replicate extreme market movements. The log returns also exhibit mean-reverting behaviour at around zero, which is a typical characteristic of financial and commodity return series. Furthermore, from around 2020 onwards, the series shows more pronounced volatility and extreme fluctuations, possibly reflecting market disruptions associated with the COVID-19 pandemic and related economic uncertainty. This increase in volatility may also explain the larger prediction deviations observed during this period, as sudden market shocks are generally more difficult for SARIMA models to capture. The results of the GARCH model are presented in
Section 4.3.
4.3. Results of the GARCH Model
Table 6 presents the estimation results of the GARCH model, which was performed using the differenced data.
The results in
Table 6 revealed that the mean (µ) estimate of -0.017 is not statistically significant with a
p-value of 0.160, suggesting the average return is not meaningfully different from zero. Similarly, the omega (Ω) parameter, which represents the constant in the volatility equation, is also not significant (
p = 0.340), indicating that the base level of volatility is small and not statistically different from zero. In contrast, gamma1
, with an estimate of 0.355 and a
p-value of 0.031, is significant, suggesting that past shocks or asymmetries have an important effect on current volatility. Lastly, beta1
has a strong and highly significant estimate of 0.823, indicating that past volatility carries over and has a major influence on current volatility, highlighting persistence in the volatility process.
4.4. SHAP Analysis Results of GRNN Model and ANN-ELM Model
To gain deeper insights into the predictive behaviour of the neural networks, SHAP analysis was performed for both GRNN and ANN-ELM models in
Section 4.4 and
Section 4.5 respectively. To visually interpret the contributions of individual lagged features to model predictions, a GRNN SHAP value diagram was generated to illustrate the magnitude and direction of each feature’s impact on the predicted gold sales values.
According to
Figure 5, the SHAP results indicate that lag 1 is the most dominant feature, implying that the most recent observation plays a crucial role in driving the model’s predictions. This provides evidence that the model places substantial emphasis on immediate past values when generating forecasts. This suggests that the GRNN model is strongly driven by short-term temporal dependence, indicating that recent gold sales observations contain the most relevant predictive information for the model. Moreover, lags 6 and 7 show moderate contributions, reflecting the presence of some medium-term influence in the predictive structure. This may reflect delayed responses in the gold sales series, where the effects of economic or market shocks may take several months to fully influence observed sales patterns. The SHAP distribution further shows that higher values of lag 1 tend to increase the predicted outcome, while lag 7 exhibits a more mixed or opposite effect, suggesting possible delayed adjustment effects in the series. In contrast, lags 11, 8, 3, 5 and 12 exhibit very small SHAP values, indicating minimal long-term or seasonal dependence within the model. This confirms that the model relies predominantly on short-term memory effects, with limited capacity to capture longer-term seasonal patterns. This partially explains the weaker performance of the GRNN model observed in
Table 7, as the model appears to underutilise longer-term structural information that may be important for capturing seasonal dynamics in gold sales. For decision-makers, this suggests that recent trends are the most important for short-term prediction, while older observations provide supporting information for more stable forecasting. The SHAP values are presented in
Table 7.
Table 8 presents the mean absolute SHAP values, summarising the overall contribution of each lag to the model’s output. The results confirm the relative importance of the lags observed in the SHAP value diagram. To visually interpret the contributions of individual lagged features to model predictions,
Figure 6 presents the ANN-ELM SHAP value diagram generated to illustrate the magnitude and direction of each feature’s impact on the predicted gold sales values.
According to
Figure 7, the SHAP results indicate that lag 12 is the most influential feature, with the highest mean absolute SHAP value of 0.019797. This reveals that the value from 12 periods prior plays a significant role in shaping current predictions, pointing to strong seasonal or long-term dependency in the model. The SHAP distribution further shows that higher values of lag 12 tend to push the predictions upward, confirming its strong positive contribution to predictability, while lower values tend to reduce the predicted outcome. Additionally, lags 7, 6, and 9 have comparatively substantial contributions, suggesting that the model’s predictions are significantly impacted by medium-term historical values. The spread of SHAP values for these lags also indicates that they provide complementary predictive information by capturing delayed temporal effects. This implies that while making predictions, the model takes into account intermediate past data in addition to seasonality. In contrast, more recent lags such as lag 1 and lag 2 have smaller contributions compared to lag 12 and the mid-range lags, implying that short-term fluctuations are less dominant in the predictive process. Furthermore, lags 3, 5, and 11 exhibit minimal SHAP values, indicating that these specific past observations contribute very little to the model’s output. Overall, the diagram highlights that the model captures strong seasonal and medium-term dynamics, rather than relying on short-term memory. From a decision-making perspective, this suggests that historical seasonal patterns are the most informative for forecasting, while very recent changes have limited influence. Overall, the SHAP analysis confirms that ANN-ELM emphasises medium- and long-term dynamics, providing an interpretable understanding of how seasonal and cyclical effects drive gold sales forecasts. The SHAP values are presented in
Table 8.
Table 8 provides the mean absolute SHAP values, which quantify the overall contribution of each lag to the model’s predictions. The ranking of lags demonstrates that lags 12 and 7 exert the greatest influence, while the remaining predictors have comparatively smaller effects. The performance of the GARCH model is compared with that of other models such as
, GRNN and ANN-ELM in
Section 4.5.
4.5. Comparison of SARIMA, GARCH, GRNN and ANN-Based ELM Models
To assess the forecasting performance of the best
, GARCH, GRNN and ANN-based ELM models, RMSE, MAE, MAPE, MFE and Theil’s U were computed for periods pre-COVID-19, COVID-19 and beyond, and for the overall periods. The results are summarised in
Table 9.
According to the results presented in
Table 9, the performance comparison of the models in the pre-COVID-19 regime reveals that the ANN-ELM model outperformed all other models in all the error matrices, implying that it captured the pre-COVID-19 patterns very well. This conclusion is supported by its lowest RMSE of 0.157 and MAE of 0.112, which indicate smaller forecast deviations compared to the other competing models. Additionally, all the models in the regime presented negative MFE; this indicates that the models overestimated the actual gold sales on average and that there is a systematic upward bias in the forecasts.
is the second-best performing model of the four and the GRNN model performed worst as it struggled to capture the pre-COVID-19 data patterns.
In the COVID-19 and beyond regime, all errors increased compared to pre-COVID, likely due to market volatility and structural changes. However, the ANN-ELM model again performed best, showing robustness even under extreme volatility. In the same regime, SARIMA’s performance dropped significantly, with RMSE increasing from 0.202 to 0.424 and MAE increasing from 0.141 to 0.347. Similar increases were observed in MAPE and Theil’s U, indicating a general deterioration in forecast accuracy under heightened volatility. Again, all models overestimated gold sales across the regime, identified by the negative MFE’s. GARCH and GRNN showed similarly large errors, highlighting the difficulty of capturing extreme events and nonlinearities during the period.
Over the full dataset, the results revealed that the model achieved the best results, with the lowest RMSE of 0.260, MAE of 0.184 and MFE of -0.016. This indicates its effectiveness in capturing the underlying patterns and seasonality in the data. The ANN-based ELM model followed as the second-best performer, outperforming both the GRNN and GARCH models with an MAPE value of 45.68 and Theil’s U of 0.241. The results showed that the GARCH model performed worse than both the and ANN-based ELM models but better than the GRNN model. In this regime, the GRNN model demonstrated the weakest performance, indicating its relative unsuitability for this dataset. Although ANN-ELM achieved the lowest MAPE (45.68%) and Theil’s U (0.241), indicating better relative percentage accuracy and benchmark performance, SARIMA maintained superior performance in terms of absolute error minimisation, as reflected by RMSE and MAE. This suggests that while ANN-ELM provides competitive performance, SARIMA provides a more stable overall forecast accuracy across the full sample period.
Overall,
, was selected as the best performing model due to its ability to effectively capture seasonal and trend patterns in the overall sample. It demonstrated consistency and reliability across the full dataset, providing stable and interpretable forecasts that balance accuracy and practical decision-making. The results highlight the strength of traditional statistical approaches such as the specified SARIMA model in this context, while also acknowledging the potential of advanced ML models such as ANN-based ELM. Therefore, it is concluded that the selected traditional model performed effectively in modelling the South African gold sales data. This is in contradiction with the study by [
17] which found the opposite to be true. The study in [
23] also revealed that the ANN outperformed the traditional ARIMA, which contradicts the findings of the current study. These differences may be attributed to variations in datasets, volatility conditions, model tuning procedures and evaluation frameworks.
4.6. DM Test Results for Forecast Comparison
Table 10 presents the results of the DM test for the overall sample.
The results in
Table 10 suggest that the difference in forecast accuracy between the
and ANN-ELM models is not statistically significant at the 5% level. This implies that although the SARIMA model exhibits marginally lower forecast errors, the improvement is not statistically distinguishable from that of ANN-ELM. The DM test confirms the
model as the better performer for the prediction of gold sales.
4.7. One Step Ahead Forecast
Table 11 presents a two-year forecast using SARIMA as the best performing model.
Table 11 presents the two-year SARIMA forecasts for the log-differenced monthly gold sales. Positive forecast values indicate expected month-to-month increases in gold sales, whereas negative values signal anticipated declines. The forecasts reveal a clear pattern of alternating gains and losses, reflecting the inherently volatile behaviour of gold sales. In the short-term horizon (the first six months of the forecast), the model predicts several negative movements interspersed with modest positive corrections. This pattern suggests temporary downward pressure on gold sales, accompanied by short-lived recoveries rather than sustained growth.
During the medium-term period (six to twelve months ahead), gold sales are forecast to continue exhibiting volatility, with fluctuations between positive and negative changes. However, the magnitude of these movements appears to be moderated by the short-term period, indicating a gradual reduction in extreme price adjustments. In the long-term horizon (twelve to twenty-four months ahead), the forecasts point to modest but more consistent positive values, suggesting an emerging upward momentum. Although occasional negative shocks remain present, the overall pattern implies a degree of stabilisation and a slow appreciation in gold sales over time. Overall, the SARIMA model effectively captures the volatile yet mean-reverting nature of gold price dynamics. The results indicate that while short-term fluctuations and uncertainty are likely to persist, the longer-term outlook suggests a gradual recovery and potential stabilisation of gold sales toward the end of the forecast horizon.
Figure 8 shows a graphical comparison of the differenced gold sales with their corresponding SARIMA forecasted values.
Figure 8 presents the two-year-ahead forecast generated by the SARIMA model using the log-differenced series. The black solid line represents the historical log-differenced gold sale values, while the red dashed line illustrates the SARIMA forecast for the next 24 months. The SARIMA forecast remains centred close to zero throughout the forecast horizon. This suggests that the model expects future changes in gold sales to fluctuate around the historical average growth rate rather than exhibit a strong upward or downward trend. The forecast also shows relatively stable and moderate fluctuations, indicating that SARIMA primarily captures the linear and seasonal components of the series but does not project extreme volatility, as observed in some historical periods. Economically, this stability implies relatively balanced market expectations where supply and demand conditions are likely to remain steady in the short- to medium-term.
5. Conclusions and Recommendations
The study investigated the forecasting performance of the SARIMA, GARCH, GRNN and ANN-based ELM using the South African gold sales series. A visual inspection suggested that the series is non-stationary. The formal test of stationarity confirmed that the gold sales series is non-stationary at level and stationary at first difference. Therefore, the series was integrated to order one,
. The findings from the linear process revealed that
is the best model for the gold sales series. In contrast, ref. [
23] found that the ANN model outperformed the ARIMA model in forecasting the gold price. The findings from GARCH revealed that past shocks or asymmetries have an important effect on current volatility and there is persistence in the volatility process.
The findings from the nonlinear process revealed that the ANN-based ELM performed better than GRNN and GARCH models. The results revealed that both the SARIMA model and the ANN-based ELM model can deliver accurate forecasts when applied to real-world scenarios. The overall findings revealed that the linear model outperformed the nonlinear models when comparing the forecasting accuracy.
The findings of this study carry significant broader implications for economic forecasting and analytical methodology. Significantly, the superior performance of the traditional SARIMA model over more complex ML techniques validates the fact that methodological sophistication does not automatically guarantee forecasting accuracy. The optimal model choice is inherently dependent on the characteristics of the data, strengthening the continuing value and interpretability of classical time-series approaches. This underscores the importance of robust preprocessing, such as achieving stationarity, as a non-negotiable first step for any modelling exercise. The SARIMA forecasts suggest that gold sales series will remain volatile in the short- to medium-term, with alternating gains and losses, before showing a modest upward momentum over the long term. Economically, this indicates that gold will continue to serve as a hedge against inflation and market uncertainty, supporting its role in portfolio diversification and risk management. For gold-producing economies and mining firms, short-term price fluctuations may affect revenues and trade balances, while long-term appreciation could enhance profitability and external reserves. Overall, the results highlight the need for active investment and risk strategies in response to gold’s inherent volatility.
For practitioners in the resource sector and beyond, these results suggest that reliable forecasting and risk management, essential for strategic planning, can often be achieved with robust, traditional statistical models. Ultimately, this research provides a cautionary benchmark against the uncritical adoption of advanced algorithms and establishes a comparative framework for evaluating model performance on other volatile economic time series. In support of the findings, ref. [
15] also determined that the SARIMA models could be a useful instrument for policymakers and researchers in formulating climate-resilient techniques for this area. By reinforcing the empirical and methodological relevance of SARIMA in seasonal financial time-series forecasting, this study contributes to both the econometric modelling literature and applied financial analytics.
The primary research contribution of this study is its empirical demonstration that, for the specific case of non-stationary South African gold sales data, a well-specified traditional linear model (SARIMA) outperformed more advanced nonlinear ML approaches (GARCH, GRNN, and ANN-ELM) in forecasting accuracy. This finding challenges the prevailing assumption that complex ML models inherently deliver superior results for economic time-series forecasting. Furthermore, the study provides a rigorous comparative framework for evaluating model performance, highlighting the critical importance of model selection based on data characteristics rather than algorithmic complexity, and offers valuable insights for stakeholders in resource economics by identifying the most reliable forecasting tools for strategic decision-making.
Despite its contributions, this study has several limitations. First, the analysis is restricted to South African gold sales series, which may limit the generalisability of the findings to other commodities, countries, or macroeconomic indicators. Second, while the study compares traditional linear models (SARIMA and GARCH) with selected nonlinear ML models (GRNN and ANN-based ELM), it does not include more advanced deep learning architectures such as Long Short-Term Memory (LSTM) or Transformer models. Thirdly, this study did not take into account the issues of potential structural breaks and regime shift in the gold sector. Additionally, the study used nominal gold sales values without adjusting for inflation, despite its potential influence on gold sales values. Lastly, this study relies on monthly data spanning January 2003 to July 2024, and the results may be sensitive to the frequency and length of the time series.
Building upon these findings, future research should explore the development and application of hybrid models that integrate the strengths of linear SARIMA frameworks with the pattern-recognition capabilities of nonlinear ANN-based ELM models, potentially leveraging SARIMA to capture linear components and ANNs to model residuals, thereby enhancing overall forecast accuracy. Subsequent studies could also validate these comparative results across different commodity markets (such as platinum, diamonds, or crude oil) and economic indicators to determine the generalisability of the findings and identify the specific data conditions under which ML models might outperform traditional models. Future studies may benefit from incorporating inflation-adjusted (real) gold sales values or including inflation as an explanatory variable to better understand the economic significance of the results. Furthermore, employing more advanced deep learning architectures, such as LSTM or Transformer networks, could provide a more rigorous benchmark for assessing the potential of complex nonlinear methods in economic forecasting using daily data. Future studies may also consider rolling window evaluation schemes and alternative loss functions to provide more comprehensive benchmarking. Future studies may also extend the analysis to other commodity markets to enhance the generalisability of the findings. Also, developing a new or hybrid forecasting model and comparing it against the existing approaches could be pursued in subsequent studies.