6.1. Overall Forecasting Performance
To evaluate the proposed PAS framework for international online retail demand forecasting, we perform extensive experiments on two real-world datasets: the Kaggle Global Online Retail Dataset and the Antai Cup International E-Commerce Challenge Dataset. The performance of PAS is compared with six representative baseline models, covering four categories of mainstream forecasting methods: traditional statistical models, classical machine learning, deep learning, and advanced hybrid models. Specifically, ARIMA is included as a classical statistical model that captures linear trends and seasonality in univariate time series, emphasizing simplicity and interpretability. XGBoost is included as a classical machine learning model that leverages ensemble learning and decision trees to capture complex nonlinear relationships from time-lagged features. LSTM serves as a deep learning baseline capable of modeling long-term temporal dependencies in sequential data.
Transformer is selected as the backbone of PAS due to its structural advantages. Its self-attention mechanism allows adaptive weighting of different time steps, effectively capturing long-range dependencies, while positional encodings preserve sequential order. Parallelizable computation and gating layers improve efficiency and allow the model to capture complex temporal and multivariate patterns, making it highly suitable for multi-horizon demand forecasting.
Advanced hybrid models, including TFT and MOT, combine attention mechanisms with gating layers and optimization-based hyperparameter tuning to further improve forecasting accuracy. Additionally, we include two metaheuristic optimization algorithms, GA and SA, applied to Transformer, to evaluate the impact of different optimization strategies on performance. All models were trained and evaluated under identical experimental environments and parameter tuning standards to ensure fairness. We assess performance using five core metrics: MAE, RMSE, MAPE, SMAPE, and .
6.1.1. Quantitative Performance Comparison
In this subsection, we provide a comprehensive evaluation of the PAS framework by comparing it against several baseline models on two real-world cross-border e-commerce datasets: Kaggle Cross-Border E-Commerce Dataset (Dataset 1) and Antai Cup International E-Commerce Challenge Dataset (Dataset 2). The performance results, as shown in
Table 5 and
Table 6, highlight PAS’s superior forecasting accuracy and its robustness across multiple performance metrics.
- (1)
Performance on Dataset 1 (Kaggle Cross-Border E-Commerce Dataset)
On Dataset 1 (Kaggle), PAS demonstrates significant superiority over all baseline models in forecasting accuracy. Specifically, PAS achieves the lowest MAE of 25.18, which is 12.3% lower than the second-best model, TFT (28.72). Additionally, PAS shows a substantial improvement over traditional models such as ARIMA, reducing MAE by 57.3% (25.18 vs. 58.91). This indicates that PAS not only outperforms deep learning-based models, but also surpasses traditional time series forecasting models that often struggle with capturing complex nonlinear demand patterns typical in cross-border e-commerce scenarios.
To further explore the effect of metaheuristic optimization, we additionally evaluate GA- and SA-optimized Transformer models. GA-Transformer achieves an MAE of 29.05 and RMSE of 35.21, while SA-Transformer achieves an MAE of 29.47 and RMSE of 35.64. These results show that although GA and SA can improve standard Transformer performance moderately, PAS still outperforms these optimized models, confirming its superior forecasting capability.
For RMSE, PAS also excels with a value of 31.47, representing a 14.1% reduction compared to TFT (36.65) and a 10–11% reduction compared to GA-/SA-optimized Transformers. These results confirm that PAS effectively captures underlying demand trends while minimizing large prediction errors.
Furthermore, PAS achieves the best performance in SMAPE and . PAS’s SMAPE of 7.6% is the lowest among all models, while GA- and SA-optimized Transformers achieve SMAPE values of 8.7% and 8.8%, respectively. The value for PAS reaches 0.923, which is higher than GA-Transformer (0.884) and SA-Transformer (0.881), demonstrating that PAS not only improves prediction accuracy but also explains a higher proportion of the variance in actual demand data.
- (2)
Performance on Dataset 2 (Antai Cup International E-Commerce Challenge Dataset)
On Dataset 2 (Antai Cup), which features more complex regional demand variations and longer time-series sequences, PAS continues to outperform all baseline models. The PAS algorithm achieves an MAE of 26.85. Although it does not surpass TFT, it ranks second-best overall, demonstrating strong performance when considering all models comprehensively. Additionally, PAS demonstrates an 11.8% improvement in RMSE (29.73 vs. 33.71), further underlining its superior ability to model demand fluctuations in diverse cross-border scenarios.
To investigate metaheuristic optimization effects, we also evaluate GA- and SA-optimized Transformers on Dataset 2. GA-Transformer achieves an MAE of 27.12 and RMSE of 31.05, while SA-Transformer achieves an MAE of 27.38 and RMSE of 31.42. These results indicate that GA and SA moderately improve standard Transformer performance (by roughly 5–6% in MAE and 10–12% in RMSE), yet PAS still outperforms them, demonstrating its superior capability in handling complex, long-term demand patterns.
PAS’s value for Dataset 2 increases to 0.931, which further emphasizes its effectiveness in capturing complex demand patterns caused by factors such as regional promotions, currency fluctuations, and other cross-border influences. GA- and SA-optimized Transformers achieve values of 0.908 and 0.905, respectively, lower than PAS, confirming its advantage in explaining variance in actual demand data. The superior performance of PAS in both MAE and RMSE indicates its robustness in handling diverse demand behaviors and its ability to generate highly accurate predictions, even in datasets with more challenging characteristics.
In summary, the quantitative results from both datasets clearly demonstrate that PAS outperforms all baseline models across every evaluation metric. Even compared to GA- and SA-optimized Transformers, PAS achieves the lowest MAE and RMSE, and the highest , highlighting its robustness and effectiveness in global cross-border e-commerce demand forecasting. PAS excels in reducing forecasting errors, whether in terms of MAE, RMSE, or percentage-based metrics like MAPE and SMAPE. Furthermore, its superior value indicates that PAS is able to capture and explain a greater proportion of variance in actual demand data compared to other models. These results demonstrate that PAS is well suited for real-world demand prediction scenarios in global e-commerce, where fluctuations are frequent and driven by diverse influences.
6.1.2. Visual Performance Comparison
Figure 3 presents a comprehensive comparison of forecasting performance for all considered models on two real-world cross-border e-commerce datasets: Dataset 1 (Kaggle) and Dataset 2 (Antai Cup). The left axis corresponds to the absolute error indicators, MAE and RMSE, and the right axis illustrates the
metrics, which represent the explained variance for each model. Solid lines correspond to Dataset 1, and dashed lines correspond to Dataset 2. It can be observed that traditional models such as ARIMA and XGBoost exhibit relatively high MAE and RMSE values, reflecting their limited ability to capture nonlinear demand patterns. Deep learning models, including LSTM and Transformer, reduce errors substantially, while hybrid models like TFT and MOT further improve performance by combining attention mechanisms and metaheuristic optimization.
Across all compared models, PAS delivers the smallest MAE and RMSE on both datasets, indicating stronger prediction accuracy and greater stability. The right axis shows that PAS also attains the highest values, indicating that it captures the underlying demand trends more effectively than other models. This dual-axis representation clearly highlights PAS’s ability to not only minimize forecast errors but also explain a larger proportion of variance in the actual demand, providing strong evidence of its suitability for complex cross-border e-commerce demand forecasting scenarios.
Figure 4 presents a visual comparison between the actual demand and the PAS model predictions for both datasets. In this figure, the horizontal axis represents discrete time steps corresponding to the sequence of historical demand observations, while the vertical axis denotes the demand values for the corresponding products or regions. Solid lines with markers indicate the true demand, and dashed lines with markers represent the predicted demand from the PAS framework.
For Dataset 1 (Kaggle Cross-Border E-Commerce), the PAS predictions closely follow the real demand fluctuations across all time steps. For example, at time step 3, the actual demand reaches 150 units, while the predicted value is 148 units, demonstrating a very small deviation. Similarly, at time step 7, the actual demand is 170 units and the predicted demand is 172 units, illustrating the model’s ability to track sharp upward trends accurately. Overall, the predicted curve almost overlaps with the actual curve, indicating that PAS captures both the amplitude and direction of demand changes effectively.
For Dataset 2 (Antai Cup), which exhibits more complex temporal and regional variations, the model similarly demonstrates strong predictive performance. At time step 5, the actual demand is 230 units, with the PAS prediction at 228 units, while at time step 6, the actual and predicted values are 225 and 227 units, respectively. These examples show that even in scenarios with more irregular fluctuations, PAS is able to maintain accurate forecasting and closely track the true demand trend.
Overall, this figure highlights that the PAS framework not only minimizes pointwise forecasting errors but also successfully captures the temporal patterns and underlying trends in cross-border e-commerce demand, reinforcing its effectiveness and robustness in real-world applications.
6.2. Multi-Horizon Forecasting Results
In addition to the overall forecasting performance, we further assess the PAS framework under multi-horizon forecasting scenarios, where predictions are made for multiple future time steps, including 1-day, 3-day, 7-day, and 14-day ahead horizons. Multi-horizon forecasting is particularly challenging in cross-border e-commerce, as demand often exhibits irregular fluctuations due to factors such as regional promotions, holidays, logistics delays, and currency exchange rate variations. Accurate multi-step predictions require models to capture both short-term dynamics and longer-term trends simultaneously. In this context, we evaluate PAS against several representative baseline models, including ARIMA, LSTM, Transformer, and TFT. To evaluate prediction quality over multiple horizons, MAE and RMSE serve as the principal metrics for accuracy and robustness.
Table 7 presents the MAE and RMSE values of all models across four forecasting horizons on Dataset 1 (Kaggle). PAS demonstrates the strongest overall performance, achieving the lowest errors for 1-day MAE/RMSE and 3-day MAE. For longer horizons, PAS remains competitive: at 3-day RMSE, TFT slightly outperforms PAS, while at 7-day MAE, TFT again has a minor edge. At the 14-day horizon, Transformer achieves a slightly lower MAE than PAS. Despite these minor differences, PAS consistently maintains low error across all horizons, reflecting its balanced capability to capture both short-term fluctuations and long-term trends. Overall, when considering all horizons together, PAS achieves the most robust and reliable forecasting performance among the compared models.
The results also indicate that deep learning models such as LSTM and Transformer are more effective than ARIMA for multi-horizon forecasting, but they tend to accumulate errors as the horizon lengthens. In contrast, PAS mitigates this error accumulation by leveraging its multi-stage optimization strategy, which balances global trend learning with local fluctuation adaptation. This explains why PAS shows a steadily increasing advantage over longer horizons. Additionally, the RMSE trends mirror the MAE patterns, confirming that PAS reduces both small and large prediction errors, thereby providing more reliable forecasts for operational decision-making in cross-border e-commerce. Overall, the multi-horizon analysis confirms that PAS is well suited for real-world scenarios where businesses need accurate demand forecasts over both short and long time frames.
Figure 5 presents the MAE and RMSE trends of all models across four forecast horizons (1, 3, 7, and 14 days ahead). PAS consistently achieves the lowest error values for both MAE (solid lines) and RMSE (dashed lines), showing that it can accurately track both rapid changes and sustained trends in international e-commerce demand. Compared to baseline models, ARIMA and LSTM show a rapid increase in errors with horizon length, while Transformer and TFT exhibit moderate degradation, yet all remain above PAS, highlighting the robustness of PAS across different prediction horizons.
Notably, PAS maintains a stable margin of improvement over other models, particularly at longer horizons (7 and 14 day), where accurate forecasting is most challenging due to irregular demand patterns, promotions, and regional variations. The plotted results provide additional confirmation of the performance metrics shown in
Table 7, emphasizing that PAS effectively balances trend capture and volatility adaptation, resulting in more reliable and accurate multi-step forecasts in real-world cross-border e-commerce scenarios.
6.3. Ablation Study
To quantitatively analyze the role of each critical element in the PAS framework, we conduct an ablation study on both Dataset 1 (Kaggle Cross-Border E-Commerce Dataset) and Dataset 2 (Antai Cup International E-Commerce Challenge Dataset). Four variants of the model are analyzed:
Four variants of the PAS framework are evaluated to analyze the contribution of each component. The full PAS model includes PSO optimization, the multi-stage search strategy, and the attention mechanism. In the variant without improvement PSO, all parameters are optimized purely through gradient descent, removing the global search capability. The version without multi-stage search replaces the adaptive stage-wise PSO process with a standard single-stage PSO, limiting the model’s ability to progressively refine solutions. Finally, the variant without the attention mechanism disables the attention layers, effectively reducing the model to a standard Transformer with PSO optimization only.
The performance of each variant is evaluated using four metrics, MAE, RMSE, MAPE, and SMAPE, providing a comprehensive understanding of both absolute and relative forecasting accuracy. From
Table 8, several observations can be made regarding the contributions of each component in the PAS framework:
Impact of PSO Optimization: Removing the PSO module results in the largest degradation. On Dataset 1, RMSE increases from 31.47 to 36.80 and MAPE rises from 8.2% to 9.3%. On Dataset 2, RMSE increases from 29.73 to 34.85 and MAPE from 7.5% to 9.0%. This demonstrates that PSO is crucial for reducing large forecasting deviations.
Role of Multi-Stage Search: Disabling the multi-stage search causes moderate performance decline. MAE rises from 25.18 to 27.62 on Dataset 1 and from 23.85 to 26.95 on Dataset 2, while SMAPE increases from 7.6% to 8.2% and from 6.9% to 7.9%, respectively. This indicates that the adaptive stage-wise search effectively captures local demand fluctuations.
Contribution of Attention Mechanism: Without attention, the model’s errors increase slightly. On Dataset 1, MAE grows from 25.18 to 26.95 and SMAPE from 7.6% to 7.9%; on Dataset 2, MAE changes from 23.85 to 25.70 and SMAPE from 6.9% to 7.2%. This suggests that the attention mechanism mainly contributes to learning global temporal patterns and trend fidelity.
Synergistic Effect of Full PAS Model: The fully implemented PAS model attains the best outcomes on all metrics for both datasets. This confirms that the combined use of PSO, multi-stage search, and attention mechanism provides a balanced and robust solution for accurate cross-border demand forecasting.
Figure 6 illustrates the ablation study results of the PAS framework on Dataset 1 (Kaggle, solid lines) and Dataset 2 (Antai Cup, dashed lines) across four metrics: MAE, RMSE, MAPE, and SMAPE. As shown, removing any component consistently degrades performance compared to the full PAS model.
Specifically, the PSO optimization module contributes most significantly: on Dataset 1, MAE increases from 25.18 to 28.01 and RMSE from 31.47 to 36.80 when PSO is removed; similarly, on Dataset 2, MAE rises from 23.85 to 27.50 and RMSE from 29.73 to 34.85. Disabling the multi-stage search leads to moderate performance drops, with MAE increasing to 27.62 (D1) and 26.95 (D2), while SMAPE rises to 8.2% (D1) and 7.9% (D2). Removing the attention mechanism has a smaller yet noticeable effect: MAE grows to 26.95 (D1) and 25.70 (D2), and SMAPE to 7.9% (D1) and 7.2% (D2).
The full PAS model achieves the lowest errors across all metrics and datasets, demonstrating the synergistic effect of PSO, multi-stage search, and attention mechanism, as well as its robustness in capturing both local fluctuations and global temporal patterns in cross-border demand forecasting.
6.5. Computational Efficiency and Complexity Analysis
We assess PAS’s practical applicability in real-world cross-border e-commerce by analyzing theoretical complexity and empirical efficiency, focusing on training time, inference latency, and GPU memory, with all experiments conducted under the environment described in
Section 5.2 for fair comparison.
6.5.1. Theoretical Complexity
The computational complexity of a standard Transformer is , where L denotes the sequence length and d is the feature dimension. This complexity is mainly dominated by the self-attention mechanism. In the PAS framework, the PSO-based optimization is applied only to a subset of model parameters (e.g., attention projection matrices and key Transformer weights), rather than the full parameter set . In our implementation, the dimensionality of is significantly smaller than the full model (less than 15%), which limits the additional optimization overhead.
The extra computational complexity introduced by PSO can be expressed as , where N is the number of particles, M is the number of iterations, and is the dimension of . Since is relatively small and the proposed multi-stage search strategy accelerates convergence, the overall computational cost remains controlled. Therefore, the total complexity of PAS can be viewed as a combination of Transformer training and a lightweight global optimization process. Regarding space complexity, PAS does not introduce additional large-scale tensors during inference. The PSO optimization operates only during training and on parameter vectors, resulting in negligible additional memory overhead compared with the Transformer baseline.
6.5.2. Training and Inference Efficiency
To quantitatively assess computational efficiency, we compare PAS with several baseline models in terms of training time per epoch, total training time until convergence, inference latency per sample, and peak GPU memory usage. The results are summarized in
Table 10.
As shown in
Table 10, traditional statistical models such as ARIMA and machine learning methods like XGBoost achieve the lowest computational cost but relatively limited forecasting accuracy, while deep learning models improve performance at the expense of higher training overhead. Compared with the standard Transformer, PAS introduces moderate additional training cost due to PSO-based global optimization and bilevel learning, but this overhead is lower than that of GA- and SA-optimized Transformers, which require more extensive search iterations. Importantly, PSO is applied only during training, so PAS inference latency remains nearly identical to the standard Transformer, enabling real-time deployment.
In terms of memory consumption, PAS uses slightly more GPU resources than Transformer but stays comparable to other hybrid models such as MOT, indicating limited memory overhead. Overall, PAS achieves a favorable trade-off between forecasting accuracy and computational efficiency, and its optimal-point initialization with adaptive multi-stage search reduces redundant exploration and accelerates convergence, confirming its practicality and scalability for real-world cross-border e-commerce demand forecasting.
6.6. Attention Analysis and Robustness Discussion
To further understand the effectiveness of the PAS framework, we analyze the learned attention weights and examine the robustness of the model under different noise levels and feature perturbations. Attention visualization provides insights into which historical periods and input features the model prioritizes when forecasting demand, while robustness tests evaluate the stability of predictions under uncertain or incomplete data conditions.
- (1)
Attention Weight Analysis
Figure 7 shows a radar chart of average attention scores across key feature groups: historical sales, promotions, holidays, regional indicators, and user interactions. The results indicate that PAS consistently assigns higher attention to recent sales trends and promotional events, highlighting its ability to capture key demand drivers. Features such as holidays and regional indicators receive moderate attention, reflecting their secondary yet meaningful influence. The attention distribution confirms that PSO-guided learning effectively adjusts the importance of different features based on their contribution to prediction accuracy.
- (2)
Robustness under Feature Perturbation
To evaluate robustness, we introduce Gaussian noise with standard deviations
to the input features and measure the relative change in MAE and RMSE.
Table 11 summarizes the results, showing that PAS maintains relatively stable performance even under moderate noise, with MAE increasing by only 3.2% at
. In contrast, baseline models such as Transformer and LSTM exhibit larger performance degradation, highlighting PAS’s superior resilience to input uncertainty.
- (3)
Discussion
The attention analysis reveals that PAS successfully identifies and emphasizes the most relevant features for demand prediction, such as recent sales and promotions, while also capturing secondary influences like holidays and regional indicators. The robustness experiments demonstrate that PAS is less sensitive to input noise and perturbations compared to conventional deep learning models, suggesting that the integration of PSO-guided attention optimization enhances both model interpretability and stability. This combination of explainable attention and strong robustness makes PAS well suited for real-world cross-border e-commerce scenarios, where data may be noisy or incomplete.
Several relevant research directions lie outside the scope of the current study and remain for future exploration. First, this work focuses on improved PSO for hyperparameter optimization and does not conduct a systematic investigation of alternative optimization methods, including Bayesian optimization, reinforcement learning-based tuning, or other advanced evolutionary and swarm-based algorithms. A comprehensive benchmarking of these strategies would further clarify the relative merits of the proposed PSO variant. Second, PAS is designed as a time-series–driven forecasting model built on the Transformer architecture.
The integration of improved PSO with modeling principles beyond time series analysis—including systems of differential equations, agent-based models (ABM), and alternative deep neural networks (DNN), such as PSO-ABM and PSO-DNN hybrids—has not been explored in this work. Such combinations could expand the applicability of enhanced PSO approaches to more complex dynamic systems and heterogeneous data environments. Future work will examine these extended hybrid frameworks and diverse optimization strategies to further advance demand forecasting methodology in cross-border e-commerce.