3.1. Demand Temporal Structure (ACF/PACF and STL)
Figure 3 shows the autocorrelation function (ACF) and partial autocorrelation function (PACF) of daily demand for Stores 1–5. In all cases, the ACF exhibits a gradual decay and significant positive correlations across multiple lags, indicating temporal dependence on demand. Recurrent peaks are also observed at lags close to 7 days, suggesting a weekly pattern in transaction behavior (95% confidence level).
The PACF shows a dominant spike at lag 1 for all stores, indicating short-term dependence. In addition, significant values appear at lags associated with weekly multiples, with magnitudes that vary across stores. For higher lags, partial autocorrelations tend to fall within the confidence intervals, suggesting a limited direct contribution from higher-order terms. These results support the inclusion of short-term lags and calendar features in the forecasting model and are consistent with the use of nonlinear methods such as RF to represent both temporal dependence and seasonal patterns in daily demand.
Figure 4 shows the STL decomposition of daily demand (Tx) for Store 1, separating the series into trend, seasonal component, and remainder. The original series exhibits high day-to-day variability, while the trend captures medium- and long-term evolution with gradual changes over the analyzed period. The seasonal component shows a stable and recurrent periodic pattern associated with the weekly demand cycle, and the remainder is centered around zero with high-frequency fluctuations and isolated events not explained by the trend or seasonality.
For Stores 2–5 the STL decomposition exhibits a similar structure with a smoothed trend, regular seasonality, and residuals centered around zero. Differences across stores are observed in demand level, seasonal amplitude, and residual magnitude, reflecting operational heterogeneity and local variability. These results support the inclusion of calendar features and temporal lags in forecasting and motivate store-specific modeling within the simulation and optimization framework.
3.3. Random Forest Forecasting Results
Table 4 summarizes the final RF hyperparameter configuration selected for each store, indicating that tuning was not uniform across units of analysis. In general, the model converged to relatively shallow trees (
) for four of the five stores, whereas one store required higher complexity (
) to capture additional patterns in the series. The
max_features parameter alternated between 0.8 and 1.0, suggesting that performance improved either by using all predictors or by using a fraction to increase tree diversity and reduce inter-tree correlation, depending on the store. Regarding regularization,
and
min_samples_leaf values in the range 5–15 indicate early splitting with overfitting control through a minimum number of observations per leaf. Finally, the number of trees (
) ranged from 100 to 300, reflecting store-level differences in the stability required; larger values are associated with variance reduction and improved forecast robustness.
Table 5 reports forecasting accuracy metrics for each store on the test set. In terms of error magnitude, RMSE ranges from 102.3 to 150.4 transactions/day, while MAE ranges from 72.1 to 102.5 transactions/day. These ranges indicate stable performance across stores, with differences attributable to store-specific demand variability. Stores 4 and 5 show the lowest RMSE (107.0 and 102.3) and MAE (77.8 and 72.1), indicating closer agreement between forecasts and observed demand during the evaluation period.
For relative metrics, MAPE remains below 13% for all stores, with values between 7.6% and 12.5%, indicating bounded percentage errors over the test set. Robust MAPE ranges from 5.5% to 7.9% and is consistently lower than standard MAPE across all stores. This gap indicates reduced sensitivity of the error estimate to extreme observations, consistent with forecasting performance that is less affected by atypical demand episodes. In addition, the same test sample size (N = 152) for all five stores supports direct comparability of the metrics under consistent evaluation conditions, enabling cross-store performance comparisons despite differences in demand dynamics.
Table 6 compiles the 10 most important variables across the five stores, reporting the importance of each predictor (%) obtained from the RF models. The feature
encodes annual seasonality in a cyclic way and avoids an artificial discontinuity at the year boundary. Temperature was included as an exogenous predictor to capture weather-related demand variability not explained by lagged demand features. The global rank is defined by the mean importance across stores, while the Store 1–Store 5 columns show store-specific contributions.
Overall, lag-based temporal variables dominate as transactions_lag_7 and transactions_lag_14 account for the largest average importance, indicating that recent autocorrelation explains a substantial share of demand dynamics. In contrast, calendar variables (e.g., day_of_week and sin_day_of_year) and the climatic variable (temperature) show comparatively lower contributions, suggesting that demand is driven primarily by internal and seasonal patterns rather than by weather conditions. The Std column summarizes cross-store variability, indicating heterogeneity in the relative influence of some predictors across locations.
3.4. GA Optimization and Policy Behavior
Figure 6 shows the evolution of GA fitness (ROI) for the five stores, including the Best, Mean, and Worst curves and the dispersion band (Mean ± Std). All cases exhibit rapid convergence. During the first generations (approximately 1–10), the Mean increases sharply and approaches the Best value. After this phase, the curves enter a plateau where further improvements are marginal, indicating that high-quality solutions are reached early and subsequent iterations provide incremental refinements.
The Mean ± Std band narrows as generations progress, indicating reduced population variability and increased stability of the solutions. The Worst trajectory shows higher variability and lower values, particularly at the beginning, which is expected during exploration when suboptimal individuals coexist with competitive ones. As the search proceeds, Worst increases, indicating that improvement is not limited to the best individual but extends to the overall population.
Across stores, convergence dynamics are similar: (i) Best reaches a high value rapidly and remains nearly constant, (ii) Mean converges close to Best, and (iii) Std decreases and remains bounded. These results indicate that the GA configuration is stable and that a moderate number of generations is sufficient to achieve practical convergence; additional generations would provide limited gains relative to the computational cost.
Table 7 reports the optimal policy parameters
identified by the GA for each store and the GA ROI achieved under the gene encoding defined in
Table 3. ROI values range from 40.2% to 42.4%, with the highest return in Store 1 (42.4%), followed by Store 5 (42.2%) and Store 4 (42.1%). Store 3 achieves 41.3%, and Store 2 records 40.2%. The maximum difference across stores is approximately 2.2 percentage points, indicating stable performance of the optimized policy across different operating conditions.
The Base Demand Factor remains close to 1.0 for all stores, with values below 1 in Store 3 (0.80), Store 4 (0.84), and Store 2 (0.86), and values close to 1 in Store 1 (0.98) and Store 5 (1.01). This distribution indicates moderate baseline adjustment, avoiding large increases in the demand/supply level over the decision horizon. The Peak Day Factor varies more widely and reflects the handling of demand peaks; Store 2 (1.40) exhibits the highest value, indicating the need to reinforce replenishment on high-demand days to sustain returns, even though its final ROI is the lowest among stores.
Table 7 reports the store-level policy parameters optimized by the GA. For each store, the reported values correspond to the best performing parameter set returned by the GA. The solution was selected by maximizing ROI fitness in the inventory simulation using historical transactional demand data. These values are not significance levels. They correspond to the store-level Variability Buffer parameter in the GA-optimized policy and represent an additional safety margin to protect against demand variability. Stores 1, 3, and 5 take the minimum value, which is 0.05, whereas Stores 2 and 4 require larger buffers.
This is consistent with scenarios where higher uncertainty or irregular demand requires additional protection to reduce stockouts. Minimum Coverage Days support this interpretation: Store 3 has the highest coverage (9.63 days), followed by Store 4 (6.66 days), Store 1 (5.84 days), and Store 5 (5.47 days), while Store 2 shows the lowest coverage (3.00 days). The combination observed for Store 2 (high buffer and low coverage) suggests a policy that compensates variability through a parametric safety margin while maintaining a shorter coverage horizon, which may be associated with higher inventory costs or sharper demand peaks.
The Weekend Factor is around 0.50–0.51 in Stores 1, 2, and 5, whereas it is close to or above 1 in Store 3 (0.99) and Store 4 (1.08). This indicates that, for these two stores, weekend behavior does not reduce effective demand and may require relatively higher replenishment compared with weekdays. The Start-of-Month Extra and End-of-Month Extra parameters capture intra-month adjustments; end-of-month increases are relatively high in Store 3 (0.23) and Store 4 (0.25), suggesting higher replenishment toward the end of the cycle, consistent with monthly decision-making.
The Conservative Factor remains within a narrow range (0.80–0.89), indicating relatively homogeneous conservatism. Store 2 (0.89) shows the highest value, consistent with a more conservative strategy focused on operational risk control, whereas Stores 1 (0.80) and 5 (0.82) show lower values, consistent with slightly more aggressive strategies associated with higher ROI. Overall, these results indicate that (i) optimization yields comparable returns across stores with limited ROI variability, and (ii) cross-store differences are primarily driven by parameters related to volatility, minimum coverage, and time-based adjustments (weekend and intra-month), rather than by large baseline shifts.
Figure 7 shows the relationship between actual monthly demand and the monthly order quantity
obtained from the GA-optimized parameters for five stores. The dashed line represents the ideal reference
, which is used to assess the alignment between the optimized order and the observed demand. The points are clustered near this line over the analyzed demand range, indicating consistent calibration of the aggregated monthly order quantity.
Dispersion around the ideal line remains bounded and reflects over-ordering, with points above the line, and under-ordering, with points below the line. These deviations are associated with the safety mechanisms encoded in the policy genes, including base demand factors, the variability buffer, minimum coverage, and temporal adjustments. Overall, the optimized scheme reproduces the scale of monthly demand and shows stable alignment across stores, supporting the use of store-specific parameter sets to generate during the training period.
3.5. Business Impact vs. Baseline (MA28)
Table 8 reports store-level ROI over the test period (five months) for the RF–GA methodology and the MA28 baseline. RF–GA ROI ranges from 38.76% (Store 3) to 41.43% (Store 5), whereas MA28 ROI ranges from 33.39% (Store 3) to 40.11% (Store 5). The ROI gain, measured as the percentage-point difference (
), is positive for all five stores and ranges from 1.32 p.p. (Store 5) to 6.80 p.p. (Store 4), with an average improvement of 4.52 p.p.
Store 4 shows the largest gain (40.41% vs. 33.61%), followed by Store 1 (40.70% vs. 34.81%) and Store 3 (38.76% vs. 33.39%). Store 5 exhibits the smallest gain because the baseline already achieves a high ROI (40.11%), leaving a limited incremental margin attributable to optimization. Overall, these results indicate consistent superiority of RF–GA over MA28 in terms of ROI during the test period.
Figure 8 shows the monthly evolution of ROI, fill rate, and the number of stockouts for two approaches, RF–GA and the MA28 baseline, aggregating performance across the five stores over the test period of five months. RF–GA ROI remains above MA28 in all evaluated months. The monthly average ROI is 40.11% for RF–GA and 35.59% for MA28, corresponding to a mean improvement of +4.51 percentage points. Monthly differences range from +1.40 to +6.31 p.p., with larger gaps observed toward the end of the period.
For fill rate, RF–GA achieves higher values in every month. The monthly average fill rate is 98.37% for RF–GA and 93.27% for MA28, corresponding to +5.10 p.p. The largest decline for MA28 occurs in 2023–12, while RF–GA maintains a high service level. Operationally, stockout counts are lower with RF–GA in all months. Over the full period, RF–GA totals 20 stockouts compared with 59 for MA28, corresponding to 39 fewer events. This reduction is consistent with the higher fill rate and the superior ROI observed for RF–GA throughout the test horizon.
3.6. Statistical Significance and Robustness
Robustness of the observed improvement was assessed by comparing monthly ROI from RF–GA against the MA28 baseline using the paired Wilcoxon signed-rank test, with store–month observations. The difference was statistically significant with a large effect size . The 95% confidence intervals indicate a mean ROI of 39.44–40.77% for RF–GA and 33.91–37.27% for MA28, with a mean difference of 3.07–5.96%. This is consistent with an average improvement of +4.51 percentage points and indicates that RF–GA superiority is unlikely to be explained by random variation. To evaluate robustness under operational variations, two sensitivity analyses were conducted over the test period while keeping the optimized policies fixed: (i) variation of the holding cost by ±50% relative to the base value, and (ii) multiplicative shocks to observed demand ranging from 0.7× to 1.3×.
Table 9 shows that ROI decreases gradually as holding cost increases for both RF–GA and MA28. Over the evaluated range (factor 0.50 to 1.50), RF–GA ROI decreases from 41.20% to 40.17%, whereas MA28 ROI decreases from 39.64% to 38.63%, with the gap remaining nearly constant in favor of RF–GA. The ROI improvement remains stable at approximately 1.54–1.56 percentage points, indicating that the relative advantage of RF–GA does not depend on the holding-cost level within the ±50% interval and that performance is consistent under reasonable changes in inventory carrying costs.
Table 10 quantifies performance under demand shocks ranging from 0.7× to 1.3×. In this sensitivity analysis, demand shocks were implemented as multiplicative factors applied to the observed daily demand. A negative shock corresponds to a factor below 1, which reduces demand relative to the baseline, while a positive shock corresponds to a factor above 1, which increases demand relative to the baseline. Under low-demand scenarios (0.7× and 0.85×), both methods achieve a fill rate of 1.00, and ROI decreases due to over-ordering; in these cases, MA28 shows a marginal ROI advantage. Under the baseline scenario (1.0×), RF–GA achieves higher ROI (0.41 vs. 0.39) and maintains a higher fill rate (0.98 vs. 0.96), with fewer stockouts (1 vs. 2).
Under positive shocks (1.15× and 1.3×), RF–GA retains higher ROI and fill rate, while stockout counts increase for both methods, reflecting higher inventory pressure. Overall, the crossover occurs between negative and positive shocks: RF–GA is competitive under normal conditions and shows an advantage when demand exceeds the baseline, whereas MA28 performs slightly better when demand decreases and excess-inventory costs dominate returns.
To verify that the ROI improvement is not explained by random variation, the paired Wilcoxon signed-rank test was applied to monthly ROI from both methods ( store–month pairs). The results confirm the superiority of RF–GA over MA28 with statistical significance and a large effect size . The 95% confidence intervals indicate a mean ROI of 39.44–40.77% for RF–GA and 33.91% –37.27% for MA28, with a mean difference of 3.07–5.96 percentage points. Overall, this supports that the observed average gain (+4.51 p.p.) is robust over the evaluated period.