Time-Series Interval Forecasting with Dual-Output Monte Carlo Dropout: A Case Study on Durian Exports

: Deep neural networks (DNNs) are prominent in predictive analytics for accurately forecasting target variables. However, inherent uncertainties necessitate constructing prediction intervals for reliability. The existing literature often lacks practical methodologies for creating predictive intervals, especially for time series with trends and seasonal patterns. This paper explicitly details a practical approach integrating dual-output Monte Carlo Dropout (MCDO) with DNNs to approximate predictive means and variances within a Bayesian framework, enabling forecast interval construction. The dual-output architecture employs a custom loss function, combining mean squared error with Softplus-derived predictive variance, ensuring non-negative variance values. Hyperparameter optimization is performed through a grid search exploring activation functions, dropout rates, epochs, and batch sizes. Empirical distributions of predictive means and variances from the MCDO demonstrate the results of the dual-output MCDO DNNs. The proposed method achieves a significant improvement in forecast accuracy, with an RMSE reduction of about 10% compared to the seasonal autoregressive integrated moving average model. Additionally, the method provides more reliable forecast intervals, as evidenced by a higher coverage proportion and narrower interval widths. A case study on Thailand’s durian export data showcases the method’s utility and applicability to other datasets with trends and/or seasonal components.


Introduction
In the field of predictive analytics, there is a growing need for models that can accurately forecast outcomes while also providing measures of the associated uncertainties.This requirement is particularly critical in time-series forecasting, where decisions often need to be made under uncertain circumstances.Accurate time-series forecasts are essential across various domains, including finance, healthcare, and supply chain management, as they facilitate strategic planning and enhance decision-making processes.However, traditional forecasting methods such as the autoregressive integrated moving average (ARIMA) and seasonal autoregressive integrated moving average (SARIMA), while effective in handling linear patterns and seasonal effects, often fall short in capturing the complex non-linear relationships present in real-world data.Moreover, these methods typically do not provide inherent measures of uncertainty, which are crucial for risk management and informed decision-making.
To address these limitations, dropout techniques have been introduced in neural networks as a practical method to approximate Bayesian inference.Gal and Ghahramani [1] proposed using dropouts in neural networks to measure model uncertainty, offering a theoretical framework for this approach.Building on this, Kendall and Gal [2] expanded the use of Monte Carlo Dropout (MCDO) to differentiate between aleatoric and epistemic uncertainties in deep learning models.Srivastava et al. [3] demonstrated the effectiveness of dropout as a regularization technique in improving neural network performance across tasks such as vision, speech recognition, and document classification.Zhang et al. [4] utilized Bayesian networks for predicting short-term traffic flow, highlighting the importance of uncertainty estimation in practical applications.Pearce et al. [5] merged Bayesian neural networks (BNNs) with deep ensemble methods, showcasing enhanced uncertainty estimation for predictions, particularly in time-series forecasting.Fortunato et al. [6] further demonstrated that a minor modification of the truncated backpropagation through time can significantly enhance regularization and uncertainty estimates with minimal computational cost and an 80% reduction in the number of parameters.
Blundell et al. [7] introduced a variational inference approach to quantify uncertainty in neural network weights, indirectly supporting the methodology behind using dropout for uncertainty estimation in time-series forecasting.This approach aligns with the dualoutput MCDO methodology proposed in our study, where we aim to improve interval forecasting for time-series data.
Recent studies have further explored and improved uncertainty estimation in neural networks.Srisuradetchai et al. [8] introduced the artificial neural network with best subset selection (ANNBS) to improve prediction accuracy by selecting an optimum subset of variables using a Monte Carlo algorithm, significantly enhancing predictive performance by reducing root mean square errors (RMSEs) for both the training and test data.Fan et al. [9] presented a deep learning framework with temporal mechanisms to capture latent patterns important for precise future predictions, generating forecasts for various timeframes and quantiles simultaneously.Lemay et al. [10] evaluated the consistency of artificial intelligence models in clinical processes, highlighting how MCDO predictions improve model repeatability and classification accuracy.Alahmari et al. [11] examined the reliability of deep learning models for segmentation and classification tasks using U-Net and LeNet-5 frameworks, emphasizing the robustness of MCDO in these applications.
Hinton et al. [12] suggested a dropout technique to reduce overfitting in large feedforward neural networks trained on small datasets, improving generalization by preventing feature detectors from co-adapting.Camarasa et al. [13] demonstrated the effectiveness of MCDO for quantifying uncertainty in multi-class segmentation tasks, enhancing the reliability of uncertainty maps in predicting misclassifications.Leibig et al. [14] showed that dropout-based Bayesian uncertainty measures effectively capture decision-making uncertainty in disease detection with deep neural networks, surpassing traditional methods.
García González et al. [15] examined the role of time-series data augmentation and dropout in enhancing deep learning models for fall detection, while Maleki Sadr et al. [16] developed an anomaly detection method for satellite telemetry using an MCDO-based approximation of BNNs.Atencia et al. [17] combined echo state networks (ESNs) with MCDO to quantify uncertainty in time-series prediction without increasing computational expenses.Sheng et al. [18] created a bootstrapping reservoir computing network ensemble (BRCNE) to generate accurate prediction intervals in noisy non-linear time-series forecasting.Khosravi et al. [19] proposed creating prediction intervals for travel time forecasts using delta and Bayesian methods, evaluating the accuracy and reliability of these predictions.
Despite advances, practical methods for constructing prediction intervals for timeseries data with trends and seasonal patterns are still lacking.Point predictions often overlook uncertainty, necessitating new models that incorporate comprehensive uncertainty estimation.This study introduces a prediction interval generation method combining dual-output MCDO with deep neural networks (DNNs).Using a Bayesian framework, it efficiently estimates predictive means and variances.This approach is validated with Thailand's durian export data, which has significant seasonal variations and trends.
The proposed methodology addresses a critical gap in forecasting techniques as follows:

•
Comprehensive uncertainty estimation: provides both aleatoric and epistemic uncertainties, offering a complete uncertainty profile; This paper is structured as follows: Section 2 reviews relevant theories, Section 3 describes the dataset, Section 4 details the methodology, Section 5 presents results, and Section 6 concludes with implications and future research directions.

Aleatoric Uncertainty
Aleatoric uncertainty is captured directly through the predictive variance in a dualoutput configuration.It is also known as data uncertainty and cannot be reduced even if more data are collected.In the context of neural networks, aleatoric uncertainty is captured directly by the network through the predictive variance output by one of the network heads.
Let y i be the actual target value, ŷi be the predicted mean output by the first head of the network, and σ 2 i be the predicted variance output by the second head of the network, representing the aleatoric uncertainty.The network is trained to minimize a loss function that incorporates both the prediction error and the uncertainty estimation.The mean squared error (MSE) modified to include the aleatoric uncertainty can be expressed as follows: where the first term (y i − ŷi ) 2 / 2σ 2 i weights the squared prediction error by the inverse of the predicted variance, emphasizing predictions with lower uncertainty; the second term log σ 2 i /2 acts as a regularizer to prevent the network from predicting infinite variance for reducing the prediction error term [20].

Epistemic Uncertainty
This type of uncertainty arises from the model itself due to limited data or knowledge about the model parameters.It is also known as model uncertainty and can be reduced by gathering more data.Epistemic uncertainty is estimated through the variance in predictions across multiple stochastic forward passes with dropout enabled, simulating a Bayesian posterior distribution.Let T be the number of stochastic forward passes, ŷi,t be the prediction for the i-th instance in the t-th forward pass.The epistemic uncertainty can be quantified as the variance of the following predictions: where ŷi,t /T is the mean prediction across all stochastic forward passes for the i-th instance.Combining both uncertainties, the total predictive uncertainty can be seen as the sum of aleatoric and epistemic uncertainties [21].

Deep Neural Networks
A deep neural network (DNN) is a type of machine learning model that belongs to a broader category of structures designed to represent sophisticated functions.These models are composed of multiple layers of units, or neurons, where each layer performs a transformation of its inputs, typically followed by a non-linear activation function.In mathematical terms, a DNN can be defined as a composition of functions, each representing a layer in the network.
For a DNN with L layers, the function f representing the network is composed of L functions f 1 , f 2 , . . ., f L , each corresponding to a layer.The output of each layer function is the input to the next layer.This can be expressed as follows: where x is the input vector, Θ denotes the set of all parameters in the network, and θ l represents the parameters for the l-th layer.Each layer function f l is typically a linear transformation followed by a non-linear activation: where W l is a weight matrix, b l is a bias vector, and g l is the activation function for the l-th layer [22,23].

Activation Functions
A common choice for the activation function includes non-linear functions like the Rectified Linear Unit (ReLU), Sigmoid, or hyperbolic tangent (Tanh).The choice of activation function and the architecture of the network determine the function class that the DNN can approximate.
Figure 1 illustrates Sigmoid, Tanh, and ReLU.Each function maps the input signal to an output signal and serves as a gate, determining whether and how signals should progress through the network.Details on each activation function are as follows [24]: • The Sigmoid function is defined as σ(x) = 1/(1 + exp(−x)), and it outputs values between 0 and 1.It is a smooth, S-shaped curve that has been widely used historically, especially for binary classification problems; • The Tanh function or hyperbolic tangent function, tanh(x), rescales the sigmoid to output values between −1 and 1.It is zero-centered, making it preferred in certain scenarios as it can help with improving the convergence during the training phase; • The ReLU function, defined as ReLu(x) = max(0,x), activates a neuron only if the input is above zero, providing a piecewise linear output that is computationally efficient and enables the model to leverage sparsity for better performance and faster training.In the dual-output architecture of the neural network, the Softplus activation function is employed for the variance output layer.This choice is strategic; Softplus, defined mathematically as ( ) log(1 exp( )) , is smooth and differentiable, making it wellsuited for predicting positive continuous variables, such as variance, which must be nonnegative.Its application ensures that the network predicts a variance that is not only always positive but also has a gradient that allows for effective backpropagation and learn- In the dual-output architecture of the neural network, the Softplus activation function is employed for the variance output layer.This choice is strategic; Softplus, defined mathematically as f (x) = log(1 + exp(x)), is smooth and differentiable, making it well-suited for predicting positive continuous variables, such as variance, which must be non-negative.Its application ensures that the network predicts a variance that is not only always positive but also has a gradient that allows for effective backpropagation and learning of uncertainty in the data [25].

Regularization
One widely recognized technique for regularization is known as dropout, a process where connections within the network are probabilistically excluded in each iteration of training [3].This method is visually represented in Figure 2, where a standard neural network, (a) before and (b) after the application of dropout, is depicted.In the dual-output architecture of the neural network, the Softplus activation function is employed for the variance output layer.This choice is strategic; Softplus, defined mathematically as ( ) log(1 exp( )) , is smooth and differentiable, making it wellsuited for predicting positive continuous variables, such as variance, which must be nonnegative.Its application ensures that the network predicts a variance that is not only always positive but also has a gradient that allows for effective backpropagation and learning of uncertainty in the data [25].

Regularization
One widely recognized technique for regularization is known as dropout, a process where connections within the network are probabilistically excluded in each iteration of training [3].This method is visually represented in Figure 2, where a standard neural network, (a) before and (b) after the application of dropout, is depicted.In this process, if wij is the weight of the connection from node i in layer l to node j in layer l + 1, during training, wij is adjusted to wijεij, with εij being a binary indicator In this process, if w ij is the weight of the connection from node i in layer l to node j in layer l + 1, during training, w ij is adjusted to w ij ε ij , with ε ij being a binary indicator determined by a Bernoulli distribution with probability 1 − p.Here, p represents the probability of a connection being retained, and accordingly, 1 − p is the probability of it being "dropped".Should ε ij be zero, the outgoing weights from node i are effectively set to zero, simulating the node's temporary removal from the network.
During the training phase, gradients corresponding to the "dropped" weights are not updated, as if those neurons are non-existent.As a result of resampling ε ij in each iteration, the network experiences various configurations of thinned networks, enhancing generalization and robustness.

Dropout as Bayesian Approximation
In Bayesian probability theory, a probabilistic model defines a distribution over possible values of the model parameters, given the data.MCDO serves as a variational Bayesian approximation where dropout is employed to approximate the posterior distribution of a BNN efficiently [26].
Given a neural network with parameters θ, the goal is to compute the posterior p(θ|D) where D represents the data.In MCDO, we use dropout to approximate this posterior as follows: 1.
A variational distribution q(θ) is defined, parameterized by ϕ.In the MCDO context, this equates to integrating dropout within the network; 2.
During each forward pass, weights θ t are sampled from q(θ) by applying dropout, which effectively samples a thinned network; 3.
The output is computed using these sampled weights, which can be denoted as y t = f (x;θ t ), where f is the neural network with dropout applied and x is the input.

Loss Function Incorporating MCDO
The training objective is to minimize the Kullback-Leibler (KL) divergence between the variational distribution q(θ) and the true posterior p(θ|D), which can be transformed into the following optimization problem known as the Evidence Lower Bound (ELBO) [27,28]: For MCDO, the ELBO simplifies because the KL divergence term can be omitted under certain conditions, and the expectation is approximated by averaging over T stochastic forward passes [1,29]:

Predictive Distribution for Interval Forecasting
For interval forecasting, we are interested in the predictive distribution P(y*|x*,D) for a new input x*.Using MCDO, we can approximate this predictive distribution by averaging over the stochastic forward passes: The variance of this predictive distribution across the T passes gives us the epistemic uncertainty, which, when combined with the aleatoric uncertainty directly estimated by the network's output, provides a full predictive distribution for interval forecasting [30][31][32].

Dataset of a Case Study
The dataset underpins Thailand's booming durian industry, which is supported by a favorable agricultural climate and strategic governmental policies, particularly highlighting the significant market demand from China [33,34].With record sales in 2021 and substantial year-on-year growth in 2023, the data, sourced from the Office of Agricultural Economics of Thailand [35], encapsulate monthly export values from January 2015 to November 2023.
The industry's exponential growth is driven by favorable agricultural conditions and robust demand, especially from China [36,37].The dataset, devoid of missing entries, encompasses monthly export values in Thai Baht (THB), showcasing a record-setting performance in May 2021.An analysis of 107 months of data reveals significant seasonal variability, with peak fluctuations in April and May.This variability underscores the forecasting challenges posed by the dynamic durian export market

Proposed Methodology
The proposed methodology integrates several stages, illustrated in Figure 3.The process begins with data extraction, followed by feature engineering to generate relevant features, including time index creation, quadratic trend estimation, and lag features.Subsequent data processing involves data cleaning; splitting into training, validation, and testing sets; and feature scaling.The methodology continues with determining the neural network topology and setting the hyperparameter grid.A custom loss function is defined to combine the mean squared error with variance regularization.Model training is performed, followed by tuning the model using the validation set.The model is then evaluated using the test data.Finally, the methodology concludes with a comparative analysis against benchmark models to assess performance improvements.

Proposed Methodology
The proposed methodology integrates several stages, illustrated in Figure 3.The process begins with data extraction, followed by feature engineering to generate relevant features, including time index creation, quadratic trend estimation, and lag features.Subsequent data processing involves data cleaning; splitting into training, validation, and testing sets; and feature scaling.The methodology continues with determining the neural network topology and setting the hyperparameter grid.A custom loss function is defined to combine the mean squared error with variance regularization.Model training is performed, followed by tuning the model using the validation set.The model is then evaluated using the test data.Finally, the methodology concludes with a comparative analysis against benchmark models to assess performance improvements.Time index creation: To capture the temporal trend, a time index was created along with its polynomial terms, such as the square or cube, to model more complex trends.These indices help in identifying underlying patterns over time.

2.
Quadratic trend estimation: A quadratic regression was fitted to the original data to model the underlying trend and subsequently remove it (detrend), thereby enhancing stationarity.The rationale for choosing a quadratic model is based on the  Time index creation: To capture the temporal trend, a time index was created along with its polynomial terms, such as the square or cube, to model more complex trends.These indices help in identifying underlying patterns over time.

2.
Quadratic trend estimation: A quadratic regression was fitted to the original data to model the underlying trend and subsequently remove it (detrend), thereby enhancing stationarity.The rationale for choosing a quadratic model is based on the observation that the time-series data for Thailand's durian export exhibited a quadratic trend.This trend estimation helps in removing long-term trends, making the time series more stationary and easier to model with neural networks.The trend equation is given as the following: where the corresponding trend line is depicted in Figure 4.While a quadratic model was suitable for this specific dataset, the proposed method is not limited to quadratic trends.Other trend models, such as linear, exponential, or higher-order polynomial regressions can be used if they better fit the data.

3.
Lag feature generation: Lag features up to 12 months prior were created to incorporate historical data points as predictors.If the data exhibit different seasonal patterns or other temporal dependencies, the lag feature span can be modified accordingly.The proposed method is robust and can adapt to different lag feature configurations to capture relevant patterns in the data.

4.
Detrending: To normalize the time-series data y i , where the model is Seasonality S, Cyclical C, and Irregular I), the detrending process involves dividing the original data by the estimated trend: This normalization results in y/T = S × C × I, removing the long-term trend and making the data more stationary.The detrended data are presented in Figure 5.
ratic trend.This trend estimation helps in removing long-term trends, making the time series more stationary and easier to model with neural networks.The trend equation is given as the following: where the corresponding trend line is depicted in Figure 4.While a quadratic model was suitable for this specific dataset, the proposed method is not limited to quadratic trends.Other trend models, such as linear, exponential, or higher-order polynomial regressions can be used if they better fit the data.Lag feature generation: Lag features up to 12 months prior were created to incorporate historical data points as predictors.If the data exhibit different seasonal patterns or other temporal dependencies, the lag feature span can be modified accordingly.The proposed method is robust and can adapt to different lag feature configurations to capture relevant patterns in the data.Trend( ) This normalization results in y T S C I = × × , removing the long-term trend and making the data more stationary.The detrended data are presented in Figure 5.

1.
Data cleaning: rows with missing values, resulting from lagged feature generation, were removed to maintain consistency; 2.
Data splitting: the dataset was split into training and test sets, ensuring a temporal split that mimics real-world forecasting scenarios; 3.
Feature scaling: the features were standardized using StandardScaler() in Python (version 3.11.5)with scikit-learn (version 1.3.0) to normalize the data, improving the neural network's convergence.

Neural Network Topology
Figure 6 shows one example architecture of the dual-output MCDO neural networks used in this study.It includes an input layer that accepts a feature vector of dimension 12 (corresponding to 12 lagged variables), three dense layers with ReLU activation (120, 60, and 30 neurons), each followed by dropout layers (with rates of 0.4, 0.3, and 0.2, respectively), and branches into two output layers: one for the mean prediction and one for the variance prediction with Softplus activation.This structure can be adjusted to fit different datasets and requirements. (1) (1) ( ) where   A minimal multi-layer perceptron (MLP) with a single hidden layer of several neurons (e.g., 10-20) was initially used to establish a baseline.Based on the baseline performance, the network depth was gradually increased by adding more layers and neurons to capture complex patterns.For instance, starting with a two-layer model (60-30 neurons) and then increasing the network depth by adding more neurons (e.g., 120 neurons for three layers, 180 for four, etc.), while carefully monitoring for overfitting.This step-by-step increase ensures that the added complexity is justified by improved performance; 2.

Hyperparameter Grid
Different activation functions such as ReLU, Sigmoid, and Tanh were evaluated for their unique characteristics in modeling; The first dense layer has 120 neurons with ReLU activation, producing a 120-dimensional vector h (1) i for each instance, calculated as follows: where W (1) ∈ R 12×120 and b (1) ∈ R 120 are the weights and biases of the dense layer, respectively, and σ denotes the chosen activation function.The subsequent dropout layer is included to mitigate overfitting by randomly setting 40% (dropout rate of 0.4) of the input units to zero at each update during the training time.

1.
A minimal multi-layer perceptron (MLP) with a single hidden layer of several neurons (e.g., 10-20) was initially used to establish a baseline.Based on the baseline performance, the network depth was gradually increased by adding more layers and neurons to capture complex patterns.For instance, starting with a two-layer model (60-30 neurons) and then increasing the network depth by adding more neurons (e.g., 120 neurons for three layers, 180 for four, etc.), while carefully monitoring for overfitting.This step-by-step increase ensures that the added complexity is justified by improved performance; 2.
Different activation functions such as ReLU, Sigmoid, and Tanh were evaluated for their unique characteristics in modeling; 3.
Dropout, as a method to combat overfitting, involves deactivating a random subset of neurons during each training cycle.Dropout rates were varied across different layers to explore their combined effects.For instance, with a configuration of [0.4,0.3, 0.2], the first hidden layer uses a dropout rate of 0.4, the second hidden layer uses 0.3, and the third hidden layer uses 0.2.This gradation from 0.2 up to 0.7 allows for a progression from minimal to more intensive regularization, adapting the model's complexity and preventing it from memorizing the training data too closely.In some experiments, up to five dropout rates were applied at different points in the network to explore their combined effects on regularization; 4.
Training sessions for the models were conducted over

Custom Loss Function
Given the predictions y pred , which consist of the mean predictions (µ i ) and variance predictions (v i ), the custom loss function for a single instance is a combination of the mean squared error (MSE) for the mean predictions and a regularization term for the variance predictions.Let y true be the vector of true values and y pred = [µ i , v i be the vector of predictions for the i-th instance.The MSE for the mean predictions is defined as follows: where y true,i refers to the true value for the i-th instance, and µ i refers to the predicted mean for the i-th instance.The Softplus function is applied to the variance predictions to ensure they are positive: The variance regularization term is the mean of the transformed variance predictions: The custom loss for the batch is the sum of the MSE and the variance regularization term: L(y true , where λ is a weighting term to balance the two components of the loss function.This balance ensures that the predicted variance reflects the true uncertainty in the predictions by preventing v + i from collapsing to zero, thus maintaining meaningful variance estimates.The dimensions of y true and y pred are not directly comparable, as y pred contains more information (both mean and variance) than y true .Validation set evaluation: The validation set (16% of the total data) was used to evaluate the performance of each model configuration.The root mean squared error (RMSE) on the validation set was used as the primary metric to compare different hyperparameter settings.The model configuration with the lowest validation RMSE was selected as the optimal model; 3.

Training Process
MCDO for uncertainty estimation: post-training, MCDO was employed to generate predictive distributions by performing multiple forward passes with dropout enabled, aggregating the results to estimate the mean and variance of predictions.

Proposed Method
The proposed method was tuned using a thorough hyperparameter grid search to identify the best combination of hyperparameters.The model configuration with the lowest validation RMSE was selected as the optimal model.The final evaluation metrics, including RMSE, were computed based on this optimized model.

Benchmark Method
The SARIMA model was also tuned to achieve the best performance.The parameters for the SARIMA model were selected using a grid search process similar to the proposed method.The best-performing SARIMA model, based on validation performance, was used for the final comparison.

Applying Dual-Output MCDO for Interval Forecasting
A dual-output neural network architecture combined with the MCDO methodology is used to forecast durian export values and provide reliable uncertainty estimation.The network employs lagged variables covering a period of 1 to 12 months to account for temporal relationships and seasonal patterns in the data.The results are as follows: 1.
Mean output: This layer predicts the expected mean value of durian exports for a future time point.The prediction is formulated as ŷMean = f (x; Θ Mean ), where x represents the input features, including lagged variables, and Θ Mean embodies the network's learned parameters.Dropout introduces a Bayesian approximation; 2.
Variance output: This layer estimates the predictive variance, which quantifies the aleatoric uncertainty inherent in the data.The estimated variance is ŷVar = f (x; Θ Var ).
The variance is directly used to create intervals around the predicted mean, using the standard formula for a 95% confidence interval [38]: Because the MCDO approach can quantify uncertainty by performing multiple stochastic forward passes with dropout, simulating a Bayesian posterior distribution of weights, the predictive distributions for both the means and variances will be graphically presented in the Section 5.

Evaluation
A comparative analysis will be conducted on the forecast intervals compared to the seasonal autoregressive integrated moving average (SARIMA) model benchmarks.This comparison will use metrics like the mean interval width, coverage proportion, and the incidence rate of non-positive lower limits.When assessing point forecasts, the evaluation will include the root mean squared error (RMSE) and the determination coefficient, represented as R 2 .The detailed descriptions of these measures are as follows:

•
Coverage proportion (%Coverage): This measure indicates the proportion of times the actual values fall within the predicted intervals [39].It is defined by the following: • Incidence of non-positive lower limits (%NegLB): This metric calculates how frequently the predicted intervals have a lower limit that is non-positive, which is crucial for datasets where such values are infeasible.It is defined by the following: • Average width of forecast interval (WidthAvg): Represents the mean distance between the upper and lower bounds of forecast intervals [40].It is defined as follows: • Root mean squared error (RMSE): This is a measure of the average discrepancy between the predicted and actual values.It is defined by the following: • Coefficient of determination (R 2 ): Indicates the fraction of the variance in the observed values that is predictable from the independent variables.It is defined by the following: • The seasonal autoregressive integrated moving average (SARIMA) model is an extension of the ARIMA model that specifically addresses and models seasonal components of a time series [41,42].Choosing SARIMA as a benchmark for comparison with proposed forecast intervals for durian export is strategic due to its comprehensive ability to model both seasonal and non-seasonal patterns in time-series data.Elements of SARIMA include non-seasonal (p, d, q) and seasonal (P, D, Q, S) terms, where p and P represent the order of the autoregressive terms, d and D signify the degree of differencing, q and Q denote the order of the moving average terms, and S corresponds to the length of the seasonal cycle.The SARIMA model is defined by the following: The autoregressive terms ϕ and Φ account for the non-seasonal and seasonal dynamics, respectively.Conversely, the moving average terms θ and Θ address the non-seasonal and seasonal parts.The lag operator L systematically shifts time-series data [43][44][45].

Results
The Section 5 is organized into four distinct subsections: the optimization of the dual-output MCDO neural network and the influence of hyperparameters; the predictive distributions of point forecasts; the distributions of forecasted variances; and the comparative analysis of forecast intervals derived from both the proposed method and the SARIMA model.

Optimal Model and Effects of the Parameters
The optimal model, which achieves the lowest custom loss function as defined in (14), consists of three hidden layers-[p = 30, p = 60, p = 120]-uses the Tanh activation function, and has dropout rates of [p = 0.4, p = 0.3, p = 0.2].It was trained for 25 epochs with a batch size of 16, resulting in an RMSE of 11,689.41.Under investigation, seven of the ten optimal models have a number of hidden layers not greater than three.This suggests that the deeper dual-output MCDO neural networks experience the overfitting issue.The RMSE of the best 40 models and corresponding hyperparameters are presented in Figure 7.
The top 10 optimal MCDO NNs typically feature smaller batch sizes, with the majority employing a batch size of only 16-contrast this with the default value of 32 in the Keras module.The number of epochs commonly ranges from 25 to 50, indicating that a larger count of epochs, such as 100, might not be necessary.It is important to note that the Keras module does not have a default value for epochs; this parameter must be explicitly specified by the user.
consists of three hidden layers-[p = 30, p = 60, p = 120]-uses the Tanh activation function, and has dropout rates of [p = 0.4, p = 0.3, p = 0.2].It was trained for 25 epochs with a batch size of 16, resulting in an RMSE of 11,689.41.Under investigation, seven of the ten optimal models have a number of hidden layers not greater than three.This suggests that the deeper dual-output MCDO neural networks experience the overfitting issue.The RMSE of the best 40 models and corresponding hyperparameters are presented in Figure 7.The top 10 optimal MCDO NNs typically feature smaller batch sizes, with the majority employing a batch size of only 16-contrast this with the default value of 32 in the Keras module.The number of epochs commonly ranges from 25 to 50, indicating that a larger count of epochs, such as 100, might not be necessary.It is important to note that the Keras module does not have a default value for epochs; this parameter must be explicitly specified by the user.Figure 8 shows that models with three layers have the lowest average RMSE, indicating the best predictive performance among the models evaluated.As the number of layers increases or decreases from three, the average RMSE tends to increase, suggesting a decrease in predictive accuracy.Models with four layers exhibit the highest average RMSE, which indicates that additional complexity in this context does not correlate with better performance.Conversely, models with two, five, and six layers show progressively higher RMSEs than the three-layer models, indicating that a medium level of complexity provides a more optimal balance between underfitting and overfitting for the given dataset.
Forecasting 2024, 6, FOR PEER REVIEW 14 Figure 8 shows that models with three layers have the lowest average RMSE, indicating the best predictive performance among the models evaluated.As the number of layers increases or decreases from three, the average RMSE tends to increase, suggesting a decrease in predictive accuracy.Models with four layers exhibit the highest average RMSE, which indicates that additional complexity in this context does not correlate with better performance.Conversely, models with two, five, and six layers show progressively higher RMSEs than the three-layer models, indicating that a medium level of complexity provides a more optimal balance between underfitting and overfitting for the given dataset.Figure 9 presents comparisons of the average RMSE of neural network models utilizing different activation functions: ReLU, Sigmoid, and Tanh.Each bar corresponds to the average RMSE for models employing one of these activation functions.The ReLU has the highest average RMSE among the three.The Sigmoid's bar is noticeably lower, suggesting better average performance than ReLU.However, it is the Tanh activation function that exhibits the lowest average RMSE of the three, indicating that models using Tanh achieved the best average predictive accuracy in this evaluation.Figure 9 presents comparisons of the average RMSE of neural network models utilizing different activation functions: ReLU, Sigmoid, and Tanh.Each bar corresponds to the average RMSE for models employing one of these activation functions.The ReLU has the highest average RMSE among the three.The Sigmoid's bar is noticeably lower, suggesting better average performance than ReLU.However, it is the Tanh activation function that exhibits the lowest average RMSE of the three, indicating that models using Tanh achieved the best average predictive accuracy in this evaluation.From Figure 10, it is observed that the models with Tanh activation tend to have lower RMSEs across various dropout rates, especially in configurations with three layers, indicating better performance.Models with ReLU activation show a wider spread in RMSE values across dropout rates, suggesting that the performance of ReLU models may be more sensitive to the chosen dropout rate.The Sigmoid activation appears to perform generally in between the ReLU and Tanh models in terms of RMSEs.From Figure 10, it is observed that the models with Tanh activation tend to have lower RMSEs across various dropout rates, especially in configurations with three layers, indicating better performance.Models with ReLU activation show a wider spread in RMSE values across dropout rates, suggesting that the performance of ReLU models may be more sensitive to the chosen dropout rate.The Sigmoid activation appears to perform generally in between the ReLU and Tanh models in terms of RMSEs.
Figure 11 presents a scatter plot illustrating the RMSEs across three distinct epoch values (25,50,100) for NNs employing different activation functions.At 25 and 100 epochs, the spread of RMSE values is relatively similar for models using ReLU and Tanh activation functions, with a few models using the Tanh function achieving slightly lower RMSE values.For models trained with 50 epochs, the Tanh activation function seems to result in a slightly lower RMSE when compared to ReLU and Sigmoid, suggesting that Tanh might be more effective, or the models may have converged better at this training length.The variability in RMSE seems to be less for the ReLU activation function, which might indicate a more stable performance or better optimization at this number of epochs compared to the others.RMSE values.For models trained with 50 epochs, the Tanh activation function seems to result in a slightly lower RMSE when compared to ReLU and Sigmoid, suggesting that Tanh might be more effective, or the models may have converged better at this training length.The variability in RMSE seems to be less for the ReLU activation function, which might indicate a more stable performance or better optimization at this number of epochs compared to the others.

Distributions of Forecasting Means
The predictive mean distributions from the optimal dual-output MCDO neural network, as depicted in Figure 12, showcase symmetrical shapes centered around their respective means, indicating a dependable central tendency that is vital for interval forecasting; however, the shapes are not the same.Notably, the distribution for month 1 in the test data (December 2022) is characterized by a narrow, pronounced peak, denoting highly precise predictions, while months 5 and 6 in the test data (April-May 2023) display a broader distribution, suggesting increased predictive uncertainty.month 9 (August 2023) exhibits a slight right skew, which points to the potential presence of outliers.Considering

Distributions of Forecasting Means
The predictive mean distributions from the optimal dual-output MCDO neural network, as depicted in Figure 12, showcase symmetrical shapes centered around their respective means, indicating a dependable central tendency that is vital for interval forecasting; however, the shapes are not the same.Notably, the distribution for month 1 in the test data (December 2022) is characterized by a narrow, pronounced peak, denoting highly precise predictions, while months 5 and 6 in the test data (April-May 2023) display a broader distribution, suggesting increased predictive uncertainty.month 9 (August 2023) exhibits a slight right skew, which points to the potential presence of outliers.Considering the extensive number of iterations (20,000) and the largely symmetrical shapes of the distributions, the application of normal approximations for the construction of forecast intervals is statistically substantiated.

Distributions of Forecasting Variances
The predictive variance histograms from the optimal dual-output MCDO neural network, as shown in Figure 13, for test months 1 to 12, exhibit a declining trend with variances primarily concentrated toward the lower end.These positively skewed distributions indicate a lower likelihood of very large variances compared to smaller ones.The more concentrated and skewed toward zero these distributions are, the narrower the forecast intervals will be, suggesting higher forecast precision.However, the presence of longer tails, particularly in months 9 to 12 (August-November 2023), points to potential outliers and predicts with less certainty, resulting in wider intervals.

Forecast Intervals of Durian Export
The comparison of the two forecasting methods, spanning from December 2022 to November 2023, is presented in Figure 14.For point forecasts, both models exhibited similar accuracy metrics.The dual-output MCDO NN reported an RMSE of 11,689.40,an MAE of 7974.43, and an R 2 coefficient of 0.2297.In contrast, the SARIMA model displayed a comparable RMSE of 11,694.33, a slightly lower MAE of 7772.35, and an R 2 value of 0.2290.
When evaluating the quality of forecast intervals-which is crucial since negative export values are not feasible-the MCDO NN demonstrated superior performance.It achieved a higher coverage proportion of 75%, as opposed to the SARIMA model's 66.67%.Moreover, the MCDO NN's intervals were on average narrower, with an average width of 13,373.84,suggesting more precise predictions compared to the SARIMA model's broader intervals, which averaged 14,514.05.The MCDO NN also exhibited a lower incidence of non-positive lower limits at 41.67% compared to 66.67% for the SARIMA model.
the extensive number of iterations (20,000) and the largely symmetrical shapes of the distributions, the application of normal approximations for the construction of forecast intervals is statistically substantiated.ances primarily concentrated toward the lower end.These positively skewed distributions indicate a lower likelihood of very large variances compared to smaller ones.The more concentrated and skewed toward zero these distributions are, the narrower the forecast intervals will be, suggesting higher forecast precision.However, the presence of longer tails, particularly in months 9 to 12 (August-November 2023), points to potential outliers and predicts with less certainty, resulting in wider intervals.

Forecast Intervals of Durian Export
The comparison of the two forecasting methods, spanning from December 2022 to November 2023, is presented in Figure 14   When evaluating the quality of forecast intervals-which is crucial since negative export values are not feasible-the MCDO NN demonstrated superior performance.It achieved a higher coverage proportion of 75%, as opposed to the SARIMA model's 66.67%.Moreover, the MCDO NN's intervals were on average narrower, with an average width of 13,373.84,suggesting more precise predictions compared to the SARIMA model's broader intervals, which averaged 14,514.05.The MCDO NN also exhibited a lower incidence of non-positive lower limits at 41.67% compared to 66.67% for the SARIMA model.
It is noticeable that in test months 5 and 6 (April-May 2023), the point forecasts are not near the true values.This can be explained by Figure 12, where the corresponding predictive mean distributions have heavy tails and more variation than in other test months, indicating high uncertainty for these point forecasts.However, Figure 13 shows that the distribution of forecasted variances for April 2023 (month 5) has a short tail with its mode very near to zero, suggesting that the resulting interval width is expected to be narrower compared to other months.
In contrast, Figure 12 illustrates that in certain months with shorter tails in the distributions, we can anticipate a certain level of accuracy in point forecasting.For instance, in It is noticeable that in test months 5 and 6 (April-May 2023), the point forecasts are not near the true values.This can be explained by Figure 12, where the corresponding predictive mean distributions have heavy tails and more variation than in other test months, indicating high uncertainty for these point forecasts.However, Figure 13 shows that the distribution of forecasted variances for April 2023 (month 5) has a short tail with its mode very near to zero, suggesting that the resulting interval width is expected to be narrower compared to other months.
In contrast, Figure 12 illustrates that in certain months with shorter tails in the distributions, we can anticipate a certain level of accuracy in point forecasting.For instance, in month 1 (December 2022), the distribution appears symmetric and not widely spread.However, the corresponding forecasting variance for month 1, as shown in Figure 13, does not have a mode close to zero, indicating that the interval width might not be as narrow as expected.

Conclusions and Discussion
This paper presents an explicit methodology based on a dual-output MCDO neural network for constructing forecast intervals in time-series data, using Thailand's durian export data as an illustrative example.The methodology effectively captures trends and seasonal patterns while providing comprehensive uncertainty estimates through a Bayesian framework.
The dual-output MCDO model improves forecast accuracy, achieving a lower RMSE compared to traditional models like SARIMA.It handles complex, non-linear relationships and prevents overfitting through dropout techniques.An optimal configuration with three hidden layers, Tanh activation, 25 epochs, and dropout rates around 0.30 was identified through a grid search.The model's flexibility allows adaptation for different datasets, demonstrating its utility in generating reliable forecast intervals.
However, validating this study on a single dataset may limit the generalizability of the results, particularly concerning the impact of hyperparameters.Additionally, the computational resources required could be substantial, which might constrain its application in some settings.
Future research can apply this methodology to other types of time-series data, refine the hyperparameter optimization process, and integrate additional external variables for enhanced performance.Potential developments include handling multivariate time series, incorporating real-time data updates, and improving computational efficiency to support larger datasets and more complex models.Exploring ensemble methods combining multiple MCDO models could further enhance robustness and accuracy.

Figure 2 .
Figure 2. Neural network (a) before and (b) after applying dropout.Arrows represent connections; missing arrows in (b) indicate dropout.

Figure 2 .
Figure 2. Neural network (a) before and (b) after applying dropout.Arrows represent connections; missing arrows in (b) indicate dropout.

Figure 3 .
Figure 3. Framework of the methodology for interval forecasting.

Figure 3 .
Figure 3. Framework of the methodology for interval forecasting.

Figure 4 .
Figure 4. Durian export data with a quadratic trend line.

4 .(
Detrending: To normalize the time-series data i y , where the model is i Trend T , Seasonality S , Cyclical C , and Irregular I ), the detrending pro- cess involves dividing the original data by the estimated trend: normalization results in y T S C I = × × , removing the long-term trend and making the data more stationary.The detrended data are presented in Figure 5.

Figure 4 .
Figure 4. Durian export data with a quadratic trend line.

Figure 4 .
Figure 4. Durian export data with a quadratic trend line.

(
normalize the time-series data i y , where the model is i Trend T , Seasonality S , Cyclical C , and Irregular I ), the detrending pro- cess involves dividing the original data by the estimated trend: ,detrended
and biases of the dense layer, respectively, and σ denotes the chosen activation function.The subsequent dropout layer is included to mitigate overfitting by randomly setting 40% (dropout rate of 0.4) of the input units to zero at each update during the training time.

Figure 6 .
Figure 6.Example of the neural network architecture of a dual-output MCDO with dropout.

Figure 6 .
Figure 6.Example of the neural network architecture of a dual-output MCDO with dropout.

1 .
Model compilation and training: Each model configuration from the hyperparameter grid was compiled and trained on the scaled and split training data.The dataset was divided into training (80%) and testing (20%) subsets to ensure proper training and evaluation.Within the training set, a further split was made to create a validation set (20% of the training set), resulting in 64% of the data used for training, 16% for validation, and 20% for testing; 2.

Figure 7 .
Figure 7. Top 40 dual-output MCDO neural networks with the lowest RMSE values.

Figure 7 .
Figure 7. Top 40 dual-output MCDO neural networks with the lowest RMSE values.

Figure 8 .
Figure8.The number of layers against the average RMSE for models with that specific layer configuration.

Figure 9 Figure 8 .
Figure 9 presents comparisons of the average RMSE of neural network models utilizing different activation functions: ReLU, Sigmoid, and Tanh.Each bar corresponds to the average RMSE for models employing one of these activation functions.The ReLU has the highest average RMSE among the three.The Sigmoid's bar is noticeably lower, sug-Figure 8.The number of layers against the average RMSE for models with that specific layer configuration.

Figure 8 .
Figure8.The number of layers against the average RMSE for models with that specific layer configuration.

Forecasting 2024, 6 ,Figure 10 .
Figure 10.Scatter plot between the average dropout rate (with added jitter) and the RMSEs of all candidate NNs.

Figure 11
Figure 11 presents a scatter plot illustrating the RMSEs across three distinct epoch values (25, 50, 100) for NNs employing different activation functions.At 25 and 100 epochs, the spread of RMSE values is relatively similar for models using ReLU and Tanh activation functions, with a few models using the Tanh function achieving slightly lower RMSE values.For models trained with 50 epochs, the Tanh activation function seems to

Figure 10 .
Figure 10.Scatter plot between the average dropout rate (with added jitter) and the RMSEs of all candidate NNs.

Figure 11 .
Figure 11.Scatter plot between the numbers of epochs (with added jitter) and the RMSEs of all candidate NNs.

Figure 11 .
Figure 11.Scatter plot between the numbers of epochs (with added jitter) and the RMSEs of all candidate NNs.

Figure 12 .
Figure 12.The predictive mean distributions from the optimal dual-output MCDO neural network for test months 1-12.

Figure 12 .
Figure 12.The predictive mean distributions from the optimal dual-output MCDO neural network for test months 1-12.

Figure 13 .
Figure 13.The distributions of forecasting variances for test months 1-12.
. For point forecasts, both models exhibited similar accuracy metrics.The dual-output MCDO NN reported an RMSE of 11,689.40,an MAE of 7974.43, and an R 2 coefficient of 0.2297.In contrast, the SARIMA model displayed a comparable RMSE of 11,694.33, a slightly lower MAE of 7772.35, and an R 2 value of

Figure 13 .
Figure 13.The distributions of forecasting variances for test months 1-12.

Figure 14 .
Figure 14.Forecast intervals obtained from the optimal dual-output MCDO neural network and SARIMA model.

Figure 14 .
Figure 14.Forecast intervals obtained from the optimal dual-output MCDO neural network and SARIMA model.
stability of the model's convergence but may prolong the duration of training.Conversely, larger batches expedite the training phase but could compromise convergence stability and the model's ability to generalize effectively across unseen data.
too many epochs can lead to overfitting, where the model too closely adapts to the training data.Conversely, too few epochs may not provide enough learning opportunity, potentially causing underfitting; 5.Batch sizes were varied across 16, 32, 64, or 128 to adjust the quantity of data samples processed simultaneously by the network.Opting for smaller batch sizes tends to enhance the