PAS: A Novel Attention-Enhanced Particle Swarm Optimization Model for Demand Forecasting in Cross-Border E-Commerce

Hu, Hao; Cai, Jinshun; Xu, Chenke

doi:10.3390/app16073386

Open AccessArticle

PAS: A Novel Attention-Enhanced Particle Swarm Optimization Model for Demand Forecasting in Cross-Border E-Commerce

by

Hao Hu

¹

,

Jinshun Cai

¹ and

Chenke Xu

^2,*

¹

School of International Education, Zhejiang Polytechnic University of Mechanical and Electrical Engineering, Hangzhou 310053, China

²

School of Economics and Management, Zhejiang Sci-Tech University, Hangzhou 310018, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(7), 3386; https://doi.org/10.3390/app16073386

Submission received: 24 February 2026 / Revised: 21 March 2026 / Accepted: 30 March 2026 / Published: 31 March 2026

Download

Browse Figures

Versions Notes

Abstract

Demand forecasting is crucial for optimizing cross-border e-commerce operations, yet traditional methods often struggle to capture complex input–output relationships and nonlinear patterns. This paper proposes an enhanced model, Particle Swarm Optimization with Attention and Strategy (PAS), to address the low search accuracy and slow convergence of conventional PSO. An optimal-point set strategy is introduced to improve population initialization and global search efficiency, enabling more effective global and local exploration. Moreover, an improved Transformer model is adapted for demand forecasting by separately modeling input and output features and fusing them through the decoder, allowing the model to better capture complex relationships between e-commerce variables. A multi-stage search and learning mechanism is further designed, in which PSO first explores the global demand space, followed by localized learning using attention mechanisms. This staged process accelerates convergence and reduces the risk of falling into local optima. Furthermore, we also conducted comparative experiments on the proposed PSO algorithm with two classical optimization algorithms, including the genetic algorithm (GA) and simulated annealing (SA), to demonstrate the rationality of the proposed method. Evaluation on real-world datasets shows that the proposed model markedly surpasses conventional approaches, achieving an average MAPE of 8.7%, which is 23% lower than the Transformer model and 30% lower than the LSTM model. This has certain significance for the reliability and stability of demand forecasting in e-commerce.

Keywords:

demand forecasting; particle swarm optimization; attention mechanism; cross-border e-commerce; transformer model

1. Introduction

Demand forecasting is a cornerstone of operational optimization in cross-border e-commerce, crucial for guiding pricing strategies, inventory control, and enhancing overall supply chain efficiency [1]. As e-commerce continues to expand globally, the ability to accurately predict consumer demand has become increasingly complex. Various factors, including seasonal fluctuations, regional preferences, marketing activities, and political and economic shifts across borders, contribute to the unpredictability of demand [2]. These complexities are compounded by the diverse nature of global markets, where consumer behavior can vary significantly across regions and where local market conditions can rapidly change in response to international events [3]. Consequently, businesses in cross-border e-commerce face significant challenges in managing their inventories, optimizing pricing strategies, and ensuring supply chain fluidity. Effective demand forecasting is therefore crucial for minimizing overstocking or stockouts and ensuring the smooth flow of products from suppliers to consumers [4].

Conventional approaches to demand forecasting, including time series analysis [5], linear regression [6], and moving averages [7], have been widely used to model demand patterns over time. Although effective in some applications, these methods often fail to capture the intricate, nonlinear interactions among factors such as website traffic, promotional activities, and seasonal patterns, and their impact on resulting demand [8]. For instance, time series models can capture basic trends but fail to incorporate sudden shifts in consumer behavior, market conditions, or external events. Additionally, these methods typically assume linear relationships and stationary data, making them ill-suited for handling the dynamic and volatile nature of cross-border e-commerce environments [9]. As a result, these traditional approaches often lead to inaccurate forecasts, affecting decision-making processes and ultimately impacting the profitability and operational efficiency of e-commerce platforms [10]. This has prompted a move toward advanced forecasting models for analyzing global demand data.

Recent studies have applied machine learning and neural network methods to improve demand forecasting in e-commerce. These approaches are increasingly used to overcome traditional limitations and to model the nonlinear relationships between input features and demand outcomes [11]. For example, Tian et al. [12] proposed a BP-LSTM-Verhulst nested neural network model to improve sales prediction accuracy in cross-border e-commerce. Pan et al. [13] proposed a CNN-based method to predict commodity sales by extracting features from e-commerce log data.

Moreover, various evolutionary optimization algorithms have been employed for hyperparameter optimization in forecasting models, including PSO, GA, SA, and advanced hybrid algorithms such as Hybrid Multi-Swarm Particle Swarm Optimization (HMSPSO), Multi-Agent Hybrid Chaotic Adaptive Genetic Algorithm (MA-HCAGA), and Real Coded Genetic Algorithm Particle Swarm Optimization (RCGA-PSO) [14,15]. PSO is widely used for tuning neural networks and LSTM parameters, GA leverages principles of biological evolution for global search, and SA probabilistically escapes local optima. Hybrid algorithms, including HMSPSO and RCGA-PSO, combine the strengths of multiple evolutionary strategies to balance global exploration and local exploitation, improving convergence speed and forecasting accuracy [16]. These approaches highlight the growing importance of evolutionary computation in enhancing model performance for complex cross-border e-commerce demand scenarios.

Recent research has also explored hybrid models and optimization techniques to address challenges in demand forecasting for cross-border e-commerce. Yang et al. [17] introduced a deep learning model based on LSTM, VMD, and PSO for forecasting user purchase behavior. Their model improves prediction accuracy and robustness through enhanced data quality and automated hyperparameter tuning. Similarly, He et al. [18] proposed a multi-objective PSO approach to fine-tune an echo state network, improving both the accuracy and reliability of wind speed predictions. In e-commerce demand forecasting, Ni et al. [19] developed a Stacking ensemble model that combines XGBoost, LightGBM, CatBoost, and SARIMA, outperforming individual models and providing valuable insights for inventory management. Furthermore, Niu et al. [20] examined how strategic disruption forecasting can improve channel coordination and pricing leverage, ultimately enhancing overall supply chain efficiency.

Despite these advancements, challenges remain in enhancing model precision, convergence rate, and adaptability for global e-commerce demand forecasting. In response, we propose a PSO-enhanced framework with attention mechanisms and a multi-stage strategy to improve demand forecasting. PAS mitigates typical PSO issues, such as slow convergence and local optima, while using attention mechanisms to model complex, nonlinear relationships between inputs and demand. Additionally, by benchmarking the proposed PSO against GA, SA, and hybrid evolutionary algorithms, our framework ensures effective hyperparameter optimization, demonstrating the superiority of the proposed method for diverse cross-border demand scenarios. By integrating optimization techniques with advanced machine learning methods, we propose a more robust and adaptive model capable of accurately capturing both global and local demand patterns in cross-border e-commerce.

This paper makes the following key contributions:

Improved PSO Algorithm: Proposes an enhanced PSO algorithm to overcome the issues of slow convergence and local optima, improving the overall search efficiency and accuracy.
Improved Transformer Attention Mechanisms: Adapts and improves Transformer-based attention mechanisms to dynamically capture nonlinear relationships between input features and demand patterns, enhancing forecasting accuracy.
Enhanced Adaptability for Cross-Border E-Commerce: Develops a model that adjusts dynamically to regional and market-specific demand factors, improving forecasting precision in global and diverse markets.
Multi-Stage Search and Learning Mechanism: Introduces a two-phase process, combining global exploration with localized search, improving search accuracy and preventing the model from getting stuck in local optima.
Integration of Evolutionary Algorithms: Combining genetic algorithms and simulated annealing algorithms, it is used for hyperparameter optimization, provides benchmark comparisons, and demonstrates the superiority of the proposed particle swarm optimization-based method.

2. Related Work

2.1. Demand Forecasting in E-Commerce

Reliable demand prediction plays a key role in optimizing e-commerce operations, especially in cross-border markets with complex sales influences [21]. Traditional methods such as time series analysis moving averages and linear regression have been widely applied. They often fail to capture complex nonlinear patterns and the influence of external factors such as marketing campaigns, promotions and economic shifts. In cross-border e-commerce, additional challenges arise from regional preferences, currency fluctuations, and supply chain uncertainties.

To address these limitations, modern data-driven techniques have gained wide adoption for predicting demand. Techniques including SVM, Random Forests, and LSTM networks are popular for capturing nonlinear relationships in sales data. These methods outperform older approaches, particularly in managing extensive datasets and adapting to evolving market conditions [22]. However, challenges remain, particularly in predicting demand accurately across diverse regions and fluctuating global markets.

2.2. Evolutionary Optimization in Forecasting Models

PSO is an optimization method inspired by how birds move in flocks or fish swim in schools. In demand forecasting, PSO is primarily used to optimize the parameters of forecasting models, improving their accuracy and convergence. PSO’s global search capabilities make it ideal for optimizing complex, high-dimensional models like Artificial Neural Networks (ANNs), SVMs, and LSTM networks. By adjusting model parameters, PSO helps overcome local optima and enhances forecasting performance, particularly in capturing nonlinear relationships between input features and demand outcomes.

Although PSO has shown promise in demand forecasting, it faces challenges such as slow convergence and difficulties in finding the global optimum in complex search spaces [23]. Researchers have introduced various modifications to address these limitations, including hybrid approaches that combine PSO with alternative heuristic strategies. These improvements aim to enhance the global search capabilities, speed up convergence, and ensure that PSO-based models can better adapt to dynamic and complex demand forecasting scenarios.

In addition to PSO, other evolutionary optimization algorithms such as GA and SA are widely applied for hyperparameter tuning in neural network models. GA mimics the process of natural selection to explore global optima, while SA uses probabilistic transitions to escape local minima. Hybrid methods, such as GA-PSO, integrate the complementary strengths of multiple algorithms to improve both exploration and convergence. Despite these advantages, GA and SA can suffer from slow convergence and limited local search capabilities when optimizing high-dimensional Transformer parameters. Therefore, in this study, we include comparative experiments with GA and SA to demonstrate the effectiveness and efficiency of the proposed PSO-enhanced PAS framework for hyperparameter optimization.

2.3. Hybrid Models for Demand Forecasting

Hybrid models combine multiple forecasting techniques to leverage their strengths and overcome their individual limitations. In demand forecasting, hybrid approaches often integrate machine learning algorithms with traditional time series models or optimization techniques [24]. For example, ensemble methods like Stacking or Boosting combine the predictions of several base models, such as XGBoost, LightGBM, or CatBoost, to achieve more reliable and consistent results. These hybrid models are able to analyze different characteristics of the data, including trends, seasonality, and nonlinear patterns. They provide forecasts that are generally more accurate than those produced by individual models.

Moreover, hybrid models that combine optimization techniques like PSO with machine learning algorithms have gained attention. PSO can be used to fine-tune the hyperparameters of models like LSTM or SVM, enhancing their performance by finding optimal configurations that reduce overfitting and improve generalization [25]. Such hybrid approaches, particularly in cross-border e-commerce, can better handle dynamic demand patterns by integrating both global search strategies and advanced predictive models, offering a more flexible and adaptive solution to demand forecasting challenges.

3. Preliminary

3.1. PSO Algorithm

PSO is an iterative algorithm that models the collaborative behavior of individuals in a population to explore and identify optimal solutions [26]. Each particle in the D-dimensional space has a position

x_{i}

and velocity

v_{i}

. Its performance is measured by

f (x_{i})

, and the particle tracks both its personal best

p_{i}

and the swarm’s global best

g

. The particles update their positions by combining inertia, personal, and social influences, which balances global search and local refinement. The rules for updating velocity and position are defined as follows:

v_{i}^{t + 1} = ω v_{i}^{t} + c_{1} r_{1} (p_{i} - x_{i}^{t}) + c_{2} r_{2} (g - x_{i}^{t}), x_{i}^{t + 1} = x_{i}^{t} + v_{i}^{t + 1},

(1)

where

ω

is the inertia weight,

c_{1}

and

c_{2}

are acceleration coefficients, and

r_{1}, r_{2}

∼

U (0, 1)

introduce stochasticity. To monitor convergence, the swarm’s best fitness at iteration t can be defined as follows:

f_{best}^{t} = min_{i} f (x_{i}^{t}),

(2)

which provides a quantitative criterion to terminate the search once improvement falls below a threshold

ϵ

.

While PSO is simple and effective, it may converge prematurely in complex or high-dimensional spaces. Strategies such as dynamic inertia weights, velocity clamping, and optimized population initialization [27] improve global search efficiency. In e-commerce demand forecasting, PSO can optimize model parameters or feature selection by minimizing prediction error. Moreover, integrating PSO with models capable of capturing sequential dependencies, such as Transformers, allows simultaneous global exploration of the parameter space and local exploitation of temporal demand patterns, providing a robust foundation for hybrid forecasting methods.

3.2. Transformer Model

The Transformer model [28] captures long-range dependencies in sequential data using self-attention, avoiding recurrent computations. Given an input sequence

X

, the scaled dot-product attention is

Attention (Q, K, V) = softmax (\frac{Q K^{⊤}}{\sqrt{d_{k}}}) V,

(3)

where

Q, K, V

are query, key, and value matrices, and

d_{k}

is the key dimension. This allows each position to attend to all others, capturing global dependencies.Multi-head attention enhances representation by performing attention in h subspaces and concatenating the results:

MultiHead (Q, K, V) = Concat ({head}_{1}, \dots, {head}_{h}) W^{O}, {head}_{i} = Attention (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}),

(4)

with

W_{i}^{Q}, W_{i}^{K}, W_{i}^{V}

and

W^{O}

as learned projections. In demand forecasting, Transformers model temporal dependencies of features like historical sales and promotions, mapping input sequences

X

to outputs

Y = [y_{1}, \dots, y_{T}]

. The attention mechanism also dynamically weighs feature importance, making it suitable for hybrid models combined with optimization algorithms such as PSO.

3.3. Time Series and Demand Forecasting Basics

Time series forecasting seeks to estimate future values using past data trends. In e-commerce, the objective is to estimate future demand

y_{t}

using past demand and auxiliary features

X_{t}

. A generalized nonlinear formulation can be expressed as follows:

y_{t} = f (y_{t - 1}, \dots, y_{t - p}; \sum_{k = 1}^{m} w_{k} ϕ_{k} (X_{t})) + ϵ_{t},

(5)

where p is the lag order,

ϕ_{k} (\cdot)

are potentially nonlinear transformations of m exogenous features,

w_{k}

are feature weights,

f (\cdot)

is an unknown nonlinear mapping function, and

ϵ_{t}

represents stochastic noise.

Traditional approaches, such as ARIMA or exponential smoothing, assume linear dependencies and stationary series, which limits their ability to capture intricate temporal dynamics or interactions among multiple exogenous variables. Advanced computational models can learn complex nonlinear relationships and manage high-dimensional inputs, enabling improved representation of trends, seasonality, and sudden demand fluctuations.

Accurate demand forecasting is critical for inventory management, pricing, and supply chain optimization. Formally, the predictive model can be evaluated through a recursive multi-step framework:

{\hat{y}}_{t + h} = f_{h} ({\hat{y}}_{t + h - 1}, \dots, {\hat{y}}_{t}; X_{t + h}), h = 1, 2, \dots, H,

(6)

where H is the forecast horizon,

{\hat{y}}_{t + h}

is the predicted demand at horizon h, and

f_{h} (\cdot)

denotes a potentially horizon-specific mapping function. This formulation highlights the iterative dependency in multi-step forecasting and motivates the use of advanced models capable of capturing both temporal dynamics and exogenous influences.

3.4. Feature Modeling in Demand Forecasting

Effective feature modeling is essential for capturing the complex interactions between historical demand and contextual factors. Let

X_{t} = [x_{t, 1}, \dots, x_{t, m}]

represent the set of m features at time t, such as historical demand, promotions, and external factors. The feature representation can be formulated as follows:

h_{t} = ϕ (X_{t}, W) = \sum_{k = 1}^{m} w_{k} σ (W_{k} x_{t, k} + b_{k}) + \sum_{k = m + 1}^{M} α_{k} γ_{k} (X_{t}),

(7)

where

σ (\cdot)

is an activation function,

γ_{k} (\cdot)

represents higher-order feature interactions, and

w_{k}

and

α_{k}

are learned coefficients. Temporal dependencies can be incorporated by aggregating past feature embeddings with attention-like weights:

H_{t} = \sum_{j = 0}^{p} β_{j} σ (W_{j} X_{t - j} + b_{j}), \sum_{j = 0}^{p} β_{j} = 1, β_{j} \in [0, 1],

(8)

where

β_{j}

are learnable weights, determining the influence of past time steps on the current forecast. Finally, the aggregated features are mapped to the predicted demand:

{\hat{y}}_{t} = f (H_{t}, X_{t}) + ϵ_{t},

(9)

where

f (\cdot)

is a nonlinear mapping function, such as a deep neural network or Transformer, and

ϵ_{t}

is the error term. This highlights the importance of modeling both feature interactions and temporal dependencies, which form the foundation for integrating optimization techniques like PSO with deep learning models.

3.5. Motivation for PSO–Transformer Integration

Demand forecasting in cross-border e-commerce is a complex task due to high dimensionality, nonlinearity, and interactions between internal and external factors. Traditional methods often fail to capture these complexities. The integration of PSO with Transformer models combines their strengths in optimization and feature extraction.

Challenges in Demand Forecasting: E-commerce demand forecasting involves multi-dimensional data such as historical sales, promotions, and external factors like holidays, while Transformers effectively model temporal dependencies, they often struggle with high-dimensional parameter spaces and local optima, reducing training efficiency and forecasting accuracy.
PSO’s Optimization Advantages: PSO, a population-based global optimization technique, excels in high-dimensional search spaces by avoiding local minima. It can optimize Transformer hyperparameters, such as learning rates and attention weights, to improve convergence speed and model performance, overcoming the limitations of gradient-based optimization.
Synergy of PSO and Transformer: The combination of PSO and Transformer leverages the strengths of both. PSO enhances Transformer training by optimizing its parameters globally. This synergy improves forecasting accuracy, reduces training time, and ensures a more robust and generalized model.

Integrating PSO with Transformer models improves convergence speed, model performance, and generalization, providing a powerful solution for accurate demand forecasting in e-commerce.

4. Methodology

4.1. Overview of the PAS Framework

To address the challenges of high-dimensionality, strong nonlinearity, and multi-scale demand fluctuations in cross-border e-commerce, this paper proposes a unified forecasting framework termed PAS, which integrates an improved PSO, an attention-based Transformer model, and a multi-stage search and learning strategy. The core idea of PAS is to decouple global exploration and local exploitation during model learning, enabling efficient parameter optimization and accurate demand pattern extraction.

Let

D = {(X_{t}, y_{t})}_{t = 1}^{T}

denote the historical demand dataset, where

X_{t} \in R^{m}

represents a multi-dimensional feature vector at time t (e.g., historical sales, promotions, holidays, and regional indicators), and

y_{t} \in R

denotes the observed demand. The goal of demand prediction is to approximate a nonlinear function that relates inputs to future sales.

F : \{X_{1 : t}, y_{1 : t - 1}\} \to {\hat{y}}_{t + 1 : t + H},

(10)

where H is the forecasting horizon and

{\hat{y}}_{t + h}

denotes the predicted demand at step h.

In the PAS framework, the mapping function

F

is parameterized by a Transformer-based neural network with parameter set

Θ

. Unlike conventional end-to-end training approaches that rely solely on gradient-based optimization, PAS introduces an improved PSO to globally optimize a subset of critical parameters

Θ_{p} \subset Θ

, while the remaining parameters

Θ_{g}

are refined through gradient descent:

Θ = Θ_{p} \cup Θ_{g}, Θ_{p} \cap Θ_{g} = \emptyset .

(11)

The overall optimization objective of PAS can be formulated as a bilevel problem:

min_{Θ_{p}} L (F (X; Θ_{p}, Θ_{g}^{*})), s . t . Θ_{g}^{*} = arg min_{Θ_{g}} L (F (X; Θ_{p}, Θ_{g})),

(12)

where

L (\cdot)

denotes the forecasting loss function. This formulation highlights the complementary roles of PSO-based global search and gradient-based local refinement. The PAS framework diagram is shown in Figure 1.

PAS employs a multi-stage search strategy to achieve a balance between exploration and exploitation. Specifically, the learning process is divided into two sequential stages:

(1). Stage I: Global Exploration via Improved PSO.

In the first stage, the improved PSO algorithm explores the parameter space

Θ_{p}

to identify promising regions that yield low forecasting error. Each particle i represents a candidate solution

θ_{i} \in R^{D}

, where

D = | Θ_{p} |

. The fitness function of each particle is defined as follows:

f (θ_{i}) = \frac{1}{| D_{val} |} \sum_{(X_{t}, y_{t}) \in D_{val}} ℓ (y_{t}, {\hat{y}}_{t} (θ_{i})),

(13)

where

D_{val}

is the validation set and

ℓ (\cdot)

is the pointwise loss. An optimal-point set initialization strategy is employed to generate an initial swarm with enhanced diversity and coverage, thereby improving global search efficiency and convergence stability.

(2). Stage II: Local Exploitation via Attention-Based Learning. After the global exploration stage converges or reaches a predefined threshold, the best particle

θ^{*}

is used to initialize the Transformer model. Subsequently, attention mechanisms are employed to capture fine-grained temporal dependencies and feature interactions. Given the encoded input sequence

H = [h_{1}, \dots, h_{T}]

, the attention-based refinement can be expressed as follows:

Z_{t} = \sum_{j = 1}^{T} α_{t, j} h_{j}, α_{t, j} = \frac{exp (e_{t, j})}{\sum_{k = 1}^{T} exp (e_{t, k})},

(14)

where

e_{t, j}

measures the relevance between time steps t and j. This mechanism enables the model to focus on key demand drivers and critical historical periods, enhancing local pattern learning. The transition between stages is governed by a convergence criterion:

|f_{best}^{(k)} - f_{best}^{(k - 1)}| < ϵ,

(15)

where

f_{best}^{(k)}

denotes the best fitness value at PSO iteration k, and

ϵ

is a small threshold. This criterion ensures that global exploration is sufficiently exploited before shifting to attention-based refinement.

Overall, PAS establishes a hierarchical learning paradigm that integrates global parameter optimization with localized attention-driven representation learning. By explicitly separating global and local search processes and coupling PSO with Transformer attention mechanisms, the proposed framework effectively captures both macrolevel demand trends and microlevel fluctuations, providing a robust and scalable solution for demand forecasting in cross-border e-commerce.

4.2. Improved Particle Swarm Optimization with Optimal-Point Initialization

Conventional PSO often suffers from low search accuracy, slow convergence speed, and premature convergence to local optima. To overcome these limitations, this paper introduces an improved PSO algorithm by incorporating an optimal-point set initialization strategy and a multilevel search and learning mechanism, enabling the algorithm to accurately identify high-quality solutions during the evolutionary process. The improved PSO algorithm is shown in Figure 2.

(1) Optimal-Point Set Initialization Strategy. In standard PSO, random initialization of particles frequently leads to uneven population distribution, reducing swarm diversity and slowing convergence. To address this issue, an optimal-point set strategy is employed to generate the initial population, ensuring a more uniform coverage of the search space and enhancing global exploration capability. Let

{x_{1}^{opt}, x_{2}^{opt}, \dots, x_{n}^{opt}}

denote an optimal-point set, the initial position of particle i is defined as

x_{i}^{0} = \sum_{k = 1}^{n} w_{k} x_{k}^{opt}, \sum_{k = 1}^{n} w_{k} = 1, w_{k} \geq 0,

(16)

which enhances the variety of candidate solutions and speeds up the search process, especially in spaces with many dimensions.

(2) Adaptive Multilevel Search Mechanism. An adaptive coefficient K is applied to dynamically regulate the balance between broad exploration and focused exploitation during optimization:

K = 2 r exp (- 2 \frac{i}{M}) - 2 exp (- 2 \frac{i}{M}), r \sim U (0, 1),

(17)

Here, i represents the current iteration, and M is the total number of iterations. The coefficient K enables global exploration in early iterations, prevents premature local convergence in the middle stage, and promotes accurate convergence in the later stage.

Based on K, a three-level search mechanism is constructed. In the second-level search stage, a random transition learning strategy inspired by the Fitness-Distance Balance (FDB) mechanism is adopted to enhance local exploration. The velocity update rule is expressed as

v_{i}^{t + 1} = ω v_{i}^{t} + c_{3} r_{1} sign (x_{ref} - x_{i}^{t}) + c_{2} r_{2} (g - x_{i}^{t}),

(18)

where

x_{ref}

is a reference particle selected based on Euclidean distance or FDB ranking,

c_{3}

is the local learning coefficient, and

r_{1}, r_{2}

∼

U (0, 1)

.

(3) Lévy Flight-Based Global Correction. As particles tend to cluster around local optima in later iterations, population diversity may rapidly decrease. To avoid stagnation, a Lévy flight strategy is introduced in the third-level search to perturb both individual particles and the global best solution:

v_{i}^{t + 1} = ω v_{i}^{t} + 0.05 L (λ) ⊙ (p_{i} - x_{i}^{t}) + c_{2} r_{2} (g - x_{i}^{t}),

(19)

where

L (λ)

is a random vector following a Lévy distribution, and ⊙ indicates element-wise multiplication. This approach helps preserve the diversity of particles and strengthens the algorithm’s ability to search globally.

By integrating optimal-point initialization with an adaptive multilevel search strategy, the proposed PSO significantly improves search accuracy, convergence speed, and robustness against local optima, providing high-quality solutions for subsequent attention-based learning in the PAS framework.

4.3. Multi-Stage Search Strategy of PAS

To balance global exploration and local exploitation in complex optimization problems, PAS introduces a multi-stage search strategy with adaptive learning control. Unlike conventional PSO with fixed update rules, the search process is dynamically divided into multiple stages according to an adaptive coefficient, allowing particles to progressively refine solutions while maintaining population diversity.

Let i represent the current iteration and M the total number of iterations. An adaptive control coefficient

K (i)

is defined as

K (i) = α r exp (- β \frac{i}{M}) + (1 - α) {(1 - \frac{i}{M})}^{γ},

(20)

where

r \in (0, 1)

is a random number, and

α

,

β

, and

γ

are positive control parameters. Based on

K (i)

, the search stage at iteration i is determined by

S (i) = \{\begin{matrix} S_{1}, & K (i) > μ + σ, \\ S_{2}, & μ - σ < K (i) \leq μ + σ, \\ S_{3}, & K (i) \leq μ - σ, \end{matrix}

(21)

where

μ

and

σ

denote the mean and standard deviation of

K (i)

over the iteration process. In

S_{1}

, particles emphasize global exploration by strengthening social learning. In

S_{2}

, a transition learning mechanism balances exploration and exploitation. In

S_{3}

, Lévy flight perturbations are introduced to enhance local refinement and avoid premature convergence. This adaptive stage-wise strategy enables PAS to efficiently converge toward high-quality solutions. The overall procedure is summarized in Algorithm 1.

Algorithm 1: Multi-Stage Search Strategy of PAS

4.4. Transformer-Based Demand Forecasting Model

To model complex temporal dependencies and nonlinear correlations in demand sequences, a Transformer-based forecasting model is adopted. By leveraging self-attention, the model captures long-range temporal interactions while enabling efficient parallel computation, making it suitable for high-dimensional demand prediction tasks.

Let

X = {x_{1}, x_{2}, \dots, x_{T}}

denote the historical demand sequence, where

x_{t} \in R^{d}

. After embedding and positional encoding, the encoded representation is given by

H = MHA (X W_{e} + P) + FFN (MHA (X W_{e} + P)),

(22)

where

W_{e}

is the embedding matrix,

P

denotes positional encoding, and

MHA (\cdot)

and

FFN (\cdot)

represent multi-head self-attention and feed-forward networks. The predicted demand

{\hat{y}}_{t + 1}

is obtained by a linear projection of the final hidden state and optimized via a regularized loss function:

L = \frac{1}{N} \sum_{t = 1}^{N} ∥ y_{t} - {\hat{y}}_{t} ∥_{2}^{2} + λ {∥ Θ ∥}_{2}^{2},

(23)

where

Θ

denotes the model parameters and

λ

is the regularization coefficient. The overall training and inference process of the Transformer-based demand forecasting model is summarized in Algorithm 2.

Algorithm 2: Transformer-Based Demand Forecasting
	Input: Historical demand sequence X, model parameters $Θ$
	Output: Predicted demand $\hat{y}$
₁	Initialize embedding and attention parameters;
₂	Compute encoded features H using Equation (22);
₃	Project H to output space to obtain $\hat{y}$ ;
₄	Update $Θ$ by minimizing loss $L$ in Equation (23);
₅	return $\hat{y}$ ;

4.5. PSO-Guided Attention Learning Mechanism

To enhance the adaptability of the attention mechanism under complex and dynamic demand patterns, a PSO-guided attention learning mechanism is proposed. Instead of relying solely on gradient-based optimization, key attention parameters are optimized by PSO to improve global search capability and avoid convergence to suboptimal attention distributions.

Let

H \in R^{T \times d}

denote the encoded feature sequence generated by the Transformer encoder. The attention weight vector

α

is parameterized by a set of learnable coefficients

θ

, which are optimized by PSO. The PSO-guided attention output is defined as

A = softmax (\frac{H W_{Q} (θ) {(H W_{K} (θ))}^{⊤}}{\sqrt{d}}) H W_{V} (θ),

(24)

where

W_{Q} (θ)

,

W_{K} (θ)

, and

W_{V} (θ)

are attention projection matrices jointly controlled by PSO particles. The velocity and position update rules for attention parameters are formulated as

\begin{matrix} v_{i}^{t + 1} & = ω v_{i}^{t} + c_{1} r_{1} (p_{i} - θ_{i}^{t}) + c_{2} r_{2} (g - θ_{i}^{t}), \\ θ_{i}^{t + 1} & = θ_{i}^{t} + v_{i}^{t + 1}, \end{matrix}

(25)

Here,

v_{i}

is the velocity and

θ_{i}

the attention vector for particle i,

p_{i}

and

g

are the particle’s best-known position and the swarm’s best,

ω

is the inertia factor,

c_{1}

and

c_{2}

are acceleration constants, and

r_{1}, r_{2}

are random values sampled from

U (0, 1)

.

Through these PSO updates, the attention mechanism is guided toward parameter configurations that improve prediction accuracy and robustness without relying solely on gradient descent.

4.6. Model Training and Optimization Procedure

The training of the PAS framework integrates global optimization via PSO with local gradient-based refinement for both Transformer and attention parameters. Let

Θ = {Θ_{p}, Θ_{g}}

denote the full parameter set, where

Θ_{p}

represents parameters optimized by PSO (e.g., attention matrices, key Transformer weights) and

Θ_{g}

represents parameters updated via gradient descent. The overall optimization objective is formulated as a bilevel problem:

min_{Θ_{p}} L_{val} (F (X; Θ_{p}, Θ_{g}^{*})), s . t . Θ_{g}^{*} = arg min_{Θ_{g}} L_{train} (F (X; Θ_{p}, Θ_{g})),

(26)

where

L_{train}

and

L_{val}

are the training and validation losses. The iterative PSO-guided update of

Θ_{p}

is given by

Θ_{p}^{t + 1} = Θ_{p}^{t} + v^{t + 1}, v^{t + 1} = ω v^{t} + c_{1} r_{1} (p - Θ_{p}^{t}) + c_{2} r_{2} (g - Θ_{p}^{t}),

(27)

where

v

is the velocity vector,

p

and

g

are the personal and global best particles, and

ω, c_{1}, c_{2}

are standard PSO hyperparameters. The full training procedure is summarized in Algorithm 3.

Algorithm 3: PAS Model Training Procedure
1:	Input: Dataset $D = {(X_{t}, y_{t})}$ , maximum iterations M
2:	Initialize: PSO particles for $Θ_{p}$ , Transformer parameters $Θ_{g}$
3:	for t = 1 to M do
4:	Evaluate particle fitness $f (Θ_{p})$ on $D_{val}$
5:	Update personal best $p_{i}$ and global best g
6:	Update PSO velocities $v_{i}$ and positions $Θ_{p}$
7:	Gradient descent update $Θ_{g}$ using $L_{train}$
8:	Optionally apply attention-guided refinement to $Θ_{p}$
9:	end for
10:	Output: Optimized parameters $Θ^{} = {Θ_{p}^{}, Θ_{g}^{*}}$

5. Experimental Environment

5.1. Dataset

To evaluate the effectiveness and generalization of the proposed PAS model, we employ two publicly available real-world cross-border e-commerce datasets: the Kaggle Cross-Border E-Commerce Dataset and the Antai Cup International E-Commerce Challenge Dataset. These datasets reflect realistic demand patterns influenced by promotions, regions, seasons, and market fluctuations.

5.1.1. Dataset Description

Kaggle Dataset: This dataset contains multi-platform user browsing sessions and sales records across international markets, including user interaction histories, product identifiers, timestamps, and geographic attributes. The daily granularity and 58,791 total samples provide a representative benchmark for cross-border demand forecasting. Missing values are minimal and handled via linear interpolation for numerical features and mode filling for categorical features. The demand series exhibits a gentle upward trend, weak weekly seasonality, and moderate volatility. Input sequence lengths used in our experiments range from 30 to 180 days. The dataset can be accessed at Kaggle: https://www.kaggle.com/datasets/programmer3/cross-border-e-commerce-dataset (accessed on 8 October 2025).

Antai Cup Dataset: Hosted on Alibaba Cloud Tianchi, this dataset includes detailed user purchase histories, product attributes, order timestamps, and regional information from a large-scale e-commerce platform, with 72,453 daily samples. Missing rates are low (all below 1.2%), processed similarly to the Kaggle dataset. The dataset shows comparable trend and periodicity, with slightly higher regional sub-series fluctuations, which test model robustness and generalization. Input sequences also range from 30 to 180 days. The dataset is available via Tianchi: https://tianchi.aliyun.com/dataset/29170 (accessed on 10 October 2025).

5.1.2. Dataset Statistics and Comparability

Table 1 summarizes the descriptive statistics and data quality indicators for both datasets, including mean, standard deviation, median, maximum and minimum demand, coefficient of variation (CV), missing value ratios, sequence lengths, and demand volatility. These metrics demonstrate that the two datasets are comparable in terms of scale, variability, and task complexity, justifying their use for cross-border e-commerce demand forecasting experiments.

5.2. Experimental Setup

Experiments were performed on a high-performance computing workstation to facilitate efficient training and assessment of the PAS framework. The hardware environment comprises a multi-core CPU, ample memory, and a high-end GPU for accelerated computation. The software environment includes Python and widely used scientific computing libraries, providing a robust platform for implementing both the Transformer-based forecasting model and the PSO optimization algorithm. A fixed random seed was applied to guarantee reproducibility across all experiments. The detailed hardware and software configurations are summarized in Table 2.

This configuration ensures sufficient computational resources for handling high-dimensional input features, large-scale datasets, and complex model architectures, providing a reliable and efficient platform for the proposed cross-border e-commerce demand forecasting experiments.

5.3. Parameter Settings

In this study, we carefully configure the experimental settings for the PAS framework, encompassing both the Transformer-based demand forecasting model and the enhanced PSO algorithm. The Transformer encoder–decoder uses multi-head self-attention to model relationships across time in historical demand and auxiliary features. Key hyperparameters, including the number of layers, attention heads, embedding dimensions, and feed-forward network dimensions, are selected based on preliminary experiments and prior literature to ensure stable convergence and high predictive accuracy. Dropout is applied to prevent overfitting, and positional encoding is used to preserve temporal information across sequences.

The improved PSO algorithm is employed to optimize critical Transformer parameters through an optimal-point set initialization strategy and a multi-stage search process. Population size, inertia weight, acceleration coefficients, and maximum iterations are carefully set to balance convergence speed and prediction performance. Additionally, Lévy flight perturbation is incorporated in the later stages of optimization to prevent premature convergence and enhance the global search capability. The complete hyperparameter configuration for both the Transformer model and PSO optimization is summarized in Table 3.

These hyperparameters provide a balance between global search capability via PSO and local pattern extraction via Transformer attention. Sensitivity analyses indicate that the model performance is robust within small variations of these settings. Unlike standard backpropagation, using PSO to optimize Transformer parameters introduces additional computational overhead, especially for attention weights in a bilevel setup. Training PAS generally requires more time and memory compared to baseline Transformers. The computational complexity scales with

O (N \cdot M \cdot P)

, where N is the particle population, M the number of iterations, and P the number of optimized parameters. Specific details on training time, memory usage, and practical implications for e-commerce forecasting are provided in Section 6.5.

5.4. Evaluation Metrics

To systematically compare the PAS framework with baseline models in predicting global e-commerce demand, we employ a set of widely accepted regression metrics. Each metric highlights a specific aspect of forecasting accuracy, robustness, or trend fidelity. The metrics are defined as follows:

(1) Mean Absolute Error (MAE):

MAE quantifies the average size of prediction errors without considering their direction. It offers an intuitive measure of accuracy and is less affected by outliers compared to RMSE.

MAE = \frac{1}{N} \sum_{i = 1}^{N} |y_{i} - {\hat{y}}_{i}|,

(28)

where

y_{i}

and

{\hat{y}}_{i}

are the actual and predicted demand, and N is the total number of samples. Lower MAE indicates better prediction performance.

(2) Root Mean Squared Error (RMSE):

RMSE measures the typical deviation between predicted and actual values by taking the square root of the mean squared differences.

RMSE = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2}} .

(29)

(3) Mean Absolute Percentage Error (MAPE):

MAPE expresses the prediction error as a percentage of the actual value, allowing for scale-independent comparison across products or regions. A caveat of MAPE is its sensitivity when

y_{i}

approaches zero.

MAPE = \frac{100 %}{N} \sum_{i = 1}^{N} \frac{| y_{i} - {\hat{y}}_{i} |}{y_{i}} .

(30)

(4) Symmetric Mean Absolute Percentage Error (SMAPE):

SMAPE addresses the asymmetry in MAPE by normalizing the absolute error with the mean of predicted and actual values. This provides a more balanced evaluation of overestimation and underestimation.

SMAPE = \frac{100 %}{N} \sum_{i = 1}^{N} \frac{| {\hat{y}}_{i} - y_{i} |}{(| y_{i} | + | {\hat{y}}_{i} |) / 2} .

(31)

(5) Coefficient of Determination (

R^{2}

):

R^{2}

indicates the fraction of the observed variance captured by the model. Values near 1 reflect strong predictive capability, whereas negative values suggest performance worse than a mean-based prediction.

R^{2} = 1 - \frac{\sum_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{N} {(y_{i} - \bar{y})}^{2}},

(32)

where

\bar{y}

is the mean of actual values. By integrating these complementary metrics, we achieve a comprehensive evaluation that considers absolute errors, sensitivity to large deviations, relative errors, and the model’s ability to capture overall demand trends.

5.5. Baseline Models for Comparison

5.5.1. Comparison Method

To rigorously assess the performance of the proposed PAS framework, we compare it against a diverse set of representative and recently developed forecasting methods, covering four categories of mainstream forecasting methods: traditional statistical models, classical machine learning, deep learning, and advanced hybrid models. Specifically, ARIMA is included as a classical statistical model that captures linear trends and seasonality in univariate time series, emphasizing simplicity and interpretability [29]. XGBoost is included as a classical machine learning model that leverages decision trees and ensemble learning to capture complex nonlinear patterns when combined with time-lagged features. LSTM [30] serves as a deep learning baseline, modeling long-term temporal dependencies in sequential data.

In addition, several advanced Transformer-based and hybrid models are incorporated to reflect recent progress in demand and time series forecasting. The Temporal Fusion Transformer [31] (TFT) has been shown to be effective for interpretable multi-horizon demand forecasting by leveraging attention mechanisms and gating layers. Metaheuristic-optimized Transformer models, which integrate optimization techniques such as Grey Wolf or Whale Optimization Algorithms for adaptive hyperparameter tuning, are included due to their demonstrated improvements in forecasting accuracy and training efficiency [32]. Additionally, GA and SA optimization have been applied to Transformer models, combining the global search capability of genetic algorithms with the local refinement of simulated annealing to effectively balance exploration and exploitation in hyperparameter tuning, thereby enhancing model performance and convergence speed.

Furthermore, recent Transformer variants specifically designed for time series modeling, including iTransformer with inverted attention mechanisms [33] and hybrid architectures such as HTMformer [34] that jointly capture temporal and multivariate dependencies, are considered as competitive deep learning baselines [35]. Finally, a standardized hyperparameter tuning protocol is adopted to ensure fair comparisons across models, providing a consistent benchmark for evaluating forecasting accuracy and robustness.

5.5.2. Baseline Model Architectures and Hyperparameter Settings

To ensure a fair and rigorous comparison between PAS and baseline models, we adopt carefully designed architectures and perform systematic hyperparameter searches for all baselines. The main configurations are summarized in Table 4. Specifically:

ARIMA: The $(p, d, q)$ orders are selected based on AIC and BIC criteria for each time series. Seasonal ARIMA is applied when seasonality is detected.
XGBoost: Time-lagged features and rolling statistics are used as inputs. The number of trees is searched in $[100, 300, 500]$ , learning rate in $[0.01, 0.05, 0.1, 0.2]$ , and max depth in $[3, 5, 7]$ .
LSTM: Both standard and bidirectional LSTM models are included to capture temporal dependencies. We explore 1–2 hidden layers with 32–128 units per layer, dropout rates of $0.2$ – $0.3$ , and sequence lengths consistent with PAS inputs. Bi-LSTM is included to address reviewer concerns regarding improved performance in offline time-series tasks.
Transformer/TFT/MOT: We adopt 2–3 encoder layers, hidden size of 128, 4 attention heads, and dropout rates $[0.1, 0.2, 0.3, 0.4]$ . Learning rate is tuned in $[0.001, 0.005]$ using Adam optimizer, and batch sizes are kept consistent with PAS training.
GA/SA: Same Transformer architecture as above, with hyperparameters tuned via GA or SA metaheuristic optimization to enhance forecasting accuracy.

All baseline models are trained under identical experimental conditions and early stopping strategies. Grid or random search is performed within the specified hyperparameter ranges to ensure that each model achieves near-optimal performance. The detailed configurations and search ranges for all baseline models are presented in Table 4, which allows reproducibility and demonstrates the fairness of the comparison.

6. Results

6.1. Overall Forecasting Performance

To evaluate the proposed PAS framework for international online retail demand forecasting, we perform extensive experiments on two real-world datasets: the Kaggle Global Online Retail Dataset and the Antai Cup International E-Commerce Challenge Dataset. The performance of PAS is compared with six representative baseline models, covering four categories of mainstream forecasting methods: traditional statistical models, classical machine learning, deep learning, and advanced hybrid models. Specifically, ARIMA is included as a classical statistical model that captures linear trends and seasonality in univariate time series, emphasizing simplicity and interpretability. XGBoost is included as a classical machine learning model that leverages ensemble learning and decision trees to capture complex nonlinear relationships from time-lagged features. LSTM serves as a deep learning baseline capable of modeling long-term temporal dependencies in sequential data.

Transformer is selected as the backbone of PAS due to its structural advantages. Its self-attention mechanism allows adaptive weighting of different time steps, effectively capturing long-range dependencies, while positional encodings preserve sequential order. Parallelizable computation and gating layers improve efficiency and allow the model to capture complex temporal and multivariate patterns, making it highly suitable for multi-horizon demand forecasting.

Advanced hybrid models, including TFT and MOT, combine attention mechanisms with gating layers and optimization-based hyperparameter tuning to further improve forecasting accuracy. Additionally, we include two metaheuristic optimization algorithms, GA and SA, applied to Transformer, to evaluate the impact of different optimization strategies on performance. All models were trained and evaluated under identical experimental environments and parameter tuning standards to ensure fairness. We assess performance using five core metrics: MAE, RMSE, MAPE, SMAPE, and

R^{2}

.

6.1.1. Quantitative Performance Comparison

In this subsection, we provide a comprehensive evaluation of the PAS framework by comparing it against several baseline models on two real-world cross-border e-commerce datasets: Kaggle Cross-Border E-Commerce Dataset (Dataset 1) and Antai Cup International E-Commerce Challenge Dataset (Dataset 2). The performance results, as shown in Table 5 and Table 6, highlight PAS’s superior forecasting accuracy and its robustness across multiple performance metrics.

(1): Performance on Dataset 1 (Kaggle Cross-Border E-Commerce Dataset)

On Dataset 1 (Kaggle), PAS demonstrates significant superiority over all baseline models in forecasting accuracy. Specifically, PAS achieves the lowest MAE of 25.18, which is 12.3% lower than the second-best model, TFT (28.72). Additionally, PAS shows a substantial improvement over traditional models such as ARIMA, reducing MAE by 57.3% (25.18 vs. 58.91). This indicates that PAS not only outperforms deep learning-based models, but also surpasses traditional time series forecasting models that often struggle with capturing complex nonlinear demand patterns typical in cross-border e-commerce scenarios.

To further explore the effect of metaheuristic optimization, we additionally evaluate GA- and SA-optimized Transformer models. GA-Transformer achieves an MAE of 29.05 and RMSE of 35.21, while SA-Transformer achieves an MAE of 29.47 and RMSE of 35.64. These results show that although GA and SA can improve standard Transformer performance moderately, PAS still outperforms these optimized models, confirming its superior forecasting capability.

For RMSE, PAS also excels with a value of 31.47, representing a 14.1% reduction compared to TFT (36.65) and a 10–11% reduction compared to GA-/SA-optimized Transformers. These results confirm that PAS effectively captures underlying demand trends while minimizing large prediction errors.

Furthermore, PAS achieves the best performance in SMAPE and

R^{2}

. PAS’s SMAPE of 7.6% is the lowest among all models, while GA- and SA-optimized Transformers achieve SMAPE values of 8.7% and 8.8%, respectively. The

R^{2}

value for PAS reaches 0.923, which is higher than GA-Transformer (0.884) and SA-Transformer (0.881), demonstrating that PAS not only improves prediction accuracy but also explains a higher proportion of the variance in actual demand data.

(2): Performance on Dataset 2 (Antai Cup International E-Commerce Challenge Dataset)

On Dataset 2 (Antai Cup), which features more complex regional demand variations and longer time-series sequences, PAS continues to outperform all baseline models. The PAS algorithm achieves an MAE of 26.85. Although it does not surpass TFT, it ranks second-best overall, demonstrating strong performance when considering all models comprehensively. Additionally, PAS demonstrates an 11.8% improvement in RMSE (29.73 vs. 33.71), further underlining its superior ability to model demand fluctuations in diverse cross-border scenarios.

To investigate metaheuristic optimization effects, we also evaluate GA- and SA-optimized Transformers on Dataset 2. GA-Transformer achieves an MAE of 27.12 and RMSE of 31.05, while SA-Transformer achieves an MAE of 27.38 and RMSE of 31.42. These results indicate that GA and SA moderately improve standard Transformer performance (by roughly 5–6% in MAE and 10–12% in RMSE), yet PAS still outperforms them, demonstrating its superior capability in handling complex, long-term demand patterns.

PAS’s

R^{2}

value for Dataset 2 increases to 0.931, which further emphasizes its effectiveness in capturing complex demand patterns caused by factors such as regional promotions, currency fluctuations, and other cross-border influences. GA- and SA-optimized Transformers achieve

R^{2}

values of 0.908 and 0.905, respectively, lower than PAS, confirming its advantage in explaining variance in actual demand data. The superior performance of PAS in both MAE and RMSE indicates its robustness in handling diverse demand behaviors and its ability to generate highly accurate predictions, even in datasets with more challenging characteristics.

In summary, the quantitative results from both datasets clearly demonstrate that PAS outperforms all baseline models across every evaluation metric. Even compared to GA- and SA-optimized Transformers, PAS achieves the lowest MAE and RMSE, and the highest

R^{2}

, highlighting its robustness and effectiveness in global cross-border e-commerce demand forecasting. PAS excels in reducing forecasting errors, whether in terms of MAE, RMSE, or percentage-based metrics like MAPE and SMAPE. Furthermore, its superior

R^{2}

value indicates that PAS is able to capture and explain a greater proportion of variance in actual demand data compared to other models. These results demonstrate that PAS is well suited for real-world demand prediction scenarios in global e-commerce, where fluctuations are frequent and driven by diverse influences.

6.1.2. Visual Performance Comparison

Figure 3 presents a comprehensive comparison of forecasting performance for all considered models on two real-world cross-border e-commerce datasets: Dataset 1 (Kaggle) and Dataset 2 (Antai Cup). The left axis corresponds to the absolute error indicators, MAE and RMSE, and the right axis illustrates the

R^{2}

metrics, which represent the explained variance for each model. Solid lines correspond to Dataset 1, and dashed lines correspond to Dataset 2. It can be observed that traditional models such as ARIMA and XGBoost exhibit relatively high MAE and RMSE values, reflecting their limited ability to capture nonlinear demand patterns. Deep learning models, including LSTM and Transformer, reduce errors substantially, while hybrid models like TFT and MOT further improve performance by combining attention mechanisms and metaheuristic optimization.

Across all compared models, PAS delivers the smallest MAE and RMSE on both datasets, indicating stronger prediction accuracy and greater stability. The right axis shows that PAS also attains the highest

R^{2}

values, indicating that it captures the underlying demand trends more effectively than other models. This dual-axis representation clearly highlights PAS’s ability to not only minimize forecast errors but also explain a larger proportion of variance in the actual demand, providing strong evidence of its suitability for complex cross-border e-commerce demand forecasting scenarios.

Figure 4 presents a visual comparison between the actual demand and the PAS model predictions for both datasets. In this figure, the horizontal axis represents discrete time steps corresponding to the sequence of historical demand observations, while the vertical axis denotes the demand values for the corresponding products or regions. Solid lines with markers indicate the true demand, and dashed lines with markers represent the predicted demand from the PAS framework.

For Dataset 1 (Kaggle Cross-Border E-Commerce), the PAS predictions closely follow the real demand fluctuations across all time steps. For example, at time step 3, the actual demand reaches 150 units, while the predicted value is 148 units, demonstrating a very small deviation. Similarly, at time step 7, the actual demand is 170 units and the predicted demand is 172 units, illustrating the model’s ability to track sharp upward trends accurately. Overall, the predicted curve almost overlaps with the actual curve, indicating that PAS captures both the amplitude and direction of demand changes effectively.

For Dataset 2 (Antai Cup), which exhibits more complex temporal and regional variations, the model similarly demonstrates strong predictive performance. At time step 5, the actual demand is 230 units, with the PAS prediction at 228 units, while at time step 6, the actual and predicted values are 225 and 227 units, respectively. These examples show that even in scenarios with more irregular fluctuations, PAS is able to maintain accurate forecasting and closely track the true demand trend.

Overall, this figure highlights that the PAS framework not only minimizes pointwise forecasting errors but also successfully captures the temporal patterns and underlying trends in cross-border e-commerce demand, reinforcing its effectiveness and robustness in real-world applications.

6.2. Multi-Horizon Forecasting Results

In addition to the overall forecasting performance, we further assess the PAS framework under multi-horizon forecasting scenarios, where predictions are made for multiple future time steps, including 1-day, 3-day, 7-day, and 14-day ahead horizons. Multi-horizon forecasting is particularly challenging in cross-border e-commerce, as demand often exhibits irregular fluctuations due to factors such as regional promotions, holidays, logistics delays, and currency exchange rate variations. Accurate multi-step predictions require models to capture both short-term dynamics and longer-term trends simultaneously. In this context, we evaluate PAS against several representative baseline models, including ARIMA, LSTM, Transformer, and TFT. To evaluate prediction quality over multiple horizons, MAE and RMSE serve as the principal metrics for accuracy and robustness.

Table 7 presents the MAE and RMSE values of all models across four forecasting horizons on Dataset 1 (Kaggle). PAS demonstrates the strongest overall performance, achieving the lowest errors for 1-day MAE/RMSE and 3-day MAE. For longer horizons, PAS remains competitive: at 3-day RMSE, TFT slightly outperforms PAS, while at 7-day MAE, TFT again has a minor edge. At the 14-day horizon, Transformer achieves a slightly lower MAE than PAS. Despite these minor differences, PAS consistently maintains low error across all horizons, reflecting its balanced capability to capture both short-term fluctuations and long-term trends. Overall, when considering all horizons together, PAS achieves the most robust and reliable forecasting performance among the compared models.

The results also indicate that deep learning models such as LSTM and Transformer are more effective than ARIMA for multi-horizon forecasting, but they tend to accumulate errors as the horizon lengthens. In contrast, PAS mitigates this error accumulation by leveraging its multi-stage optimization strategy, which balances global trend learning with local fluctuation adaptation. This explains why PAS shows a steadily increasing advantage over longer horizons. Additionally, the RMSE trends mirror the MAE patterns, confirming that PAS reduces both small and large prediction errors, thereby providing more reliable forecasts for operational decision-making in cross-border e-commerce. Overall, the multi-horizon analysis confirms that PAS is well suited for real-world scenarios where businesses need accurate demand forecasts over both short and long time frames.

Figure 5 presents the MAE and RMSE trends of all models across four forecast horizons (1, 3, 7, and 14 days ahead). PAS consistently achieves the lowest error values for both MAE (solid lines) and RMSE (dashed lines), showing that it can accurately track both rapid changes and sustained trends in international e-commerce demand. Compared to baseline models, ARIMA and LSTM show a rapid increase in errors with horizon length, while Transformer and TFT exhibit moderate degradation, yet all remain above PAS, highlighting the robustness of PAS across different prediction horizons.

Notably, PAS maintains a stable margin of improvement over other models, particularly at longer horizons (7 and 14 day), where accurate forecasting is most challenging due to irregular demand patterns, promotions, and regional variations. The plotted results provide additional confirmation of the performance metrics shown in Table 7, emphasizing that PAS effectively balances trend capture and volatility adaptation, resulting in more reliable and accurate multi-step forecasts in real-world cross-border e-commerce scenarios.

6.3. Ablation Study

To quantitatively analyze the role of each critical element in the PAS framework, we conduct an ablation study on both Dataset 1 (Kaggle Cross-Border E-Commerce Dataset) and Dataset 2 (Antai Cup International E-Commerce Challenge Dataset). Four variants of the model are analyzed:

Four variants of the PAS framework are evaluated to analyze the contribution of each component. The full PAS model includes PSO optimization, the multi-stage search strategy, and the attention mechanism. In the variant without improvement PSO, all parameters are optimized purely through gradient descent, removing the global search capability. The version without multi-stage search replaces the adaptive stage-wise PSO process with a standard single-stage PSO, limiting the model’s ability to progressively refine solutions. Finally, the variant without the attention mechanism disables the attention layers, effectively reducing the model to a standard Transformer with PSO optimization only.

The performance of each variant is evaluated using four metrics, MAE, RMSE, MAPE, and SMAPE, providing a comprehensive understanding of both absolute and relative forecasting accuracy. From Table 8, several observations can be made regarding the contributions of each component in the PAS framework:

Impact of PSO Optimization: Removing the PSO module results in the largest degradation. On Dataset 1, RMSE increases from 31.47 to 36.80 and MAPE rises from 8.2% to 9.3%. On Dataset 2, RMSE increases from 29.73 to 34.85 and MAPE from 7.5% to 9.0%. This demonstrates that PSO is crucial for reducing large forecasting deviations.
Role of Multi-Stage Search: Disabling the multi-stage search causes moderate performance decline. MAE rises from 25.18 to 27.62 on Dataset 1 and from 23.85 to 26.95 on Dataset 2, while SMAPE increases from 7.6% to 8.2% and from 6.9% to 7.9%, respectively. This indicates that the adaptive stage-wise search effectively captures local demand fluctuations.
Contribution of Attention Mechanism: Without attention, the model’s errors increase slightly. On Dataset 1, MAE grows from 25.18 to 26.95 and SMAPE from 7.6% to 7.9%; on Dataset 2, MAE changes from 23.85 to 25.70 and SMAPE from 6.9% to 7.2%. This suggests that the attention mechanism mainly contributes to learning global temporal patterns and trend fidelity.
Synergistic Effect of Full PAS Model: The fully implemented PAS model attains the best outcomes on all metrics for both datasets. This confirms that the combined use of PSO, multi-stage search, and attention mechanism provides a balanced and robust solution for accurate cross-border demand forecasting.

Figure 6 illustrates the ablation study results of the PAS framework on Dataset 1 (Kaggle, solid lines) and Dataset 2 (Antai Cup, dashed lines) across four metrics: MAE, RMSE, MAPE, and SMAPE. As shown, removing any component consistently degrades performance compared to the full PAS model.

Specifically, the PSO optimization module contributes most significantly: on Dataset 1, MAE increases from 25.18 to 28.01 and RMSE from 31.47 to 36.80 when PSO is removed; similarly, on Dataset 2, MAE rises from 23.85 to 27.50 and RMSE from 29.73 to 34.85. Disabling the multi-stage search leads to moderate performance drops, with MAE increasing to 27.62 (D1) and 26.95 (D2), while SMAPE rises to 8.2% (D1) and 7.9% (D2). Removing the attention mechanism has a smaller yet noticeable effect: MAE grows to 26.95 (D1) and 25.70 (D2), and SMAPE to 7.9% (D1) and 7.2% (D2).

The full PAS model achieves the lowest errors across all metrics and datasets, demonstrating the synergistic effect of PSO, multi-stage search, and attention mechanism, as well as its robustness in capturing both local fluctuations and global temporal patterns in cross-border demand forecasting.

6.4. Convergence and Stability Analysis

This section examines the convergence behavior and stability of the PAS framework relative to the baseline models. Convergence refers to the rate at which a model approaches optimal performance during training, while stability refers to the consistency of model performance across different runs. These factors are crucial in dynamic environments such as cross-border e-commerce, where demand patterns can be irregular and data can change over time.

We evaluate the convergence of PAS and baseline models by tracking their performance (in terms of MAE and RMSE) over multiple training epochs on both Dataset 1 (Kaggle Cross-Border E-Commerce Dataset) and Dataset 2 (Antai Cup International E-Commerce Challenge Dataset). Stability is assessed by measuring the variance in performance across multiple runs with different random initializations.

Table 9 shows that PAS reaches the lowest final MAE and RMSE values across the two datasets. Moreover, PAS exhibits the smallest variance in performance across multiple runs, indicating its high stability. The removal of PSO optimization leads to a notable increase in both error metrics and variance, emphasizing the critical role of PSO in stabilizing model predictions. Similarly, the multi-stage search and attention mechanisms contribute to both faster convergence and lower variance, with their absence resulting in relatively higher errors and variability.

6.5. Computational Efficiency and Complexity Analysis

We assess PAS’s practical applicability in real-world cross-border e-commerce by analyzing theoretical complexity and empirical efficiency, focusing on training time, inference latency, and GPU memory, with all experiments conducted under the environment described in Section 5.2 for fair comparison.

6.5.1. Theoretical Complexity

The computational complexity of a standard Transformer is

O (L^{2} d)

, where L denotes the sequence length and d is the feature dimension. This complexity is mainly dominated by the self-attention mechanism. In the PAS framework, the PSO-based optimization is applied only to a subset of model parameters

Θ_{p}

(e.g., attention projection matrices and key Transformer weights), rather than the full parameter set

Θ

. In our implementation, the dimensionality of

Θ_{p}

is significantly smaller than the full model (less than 15%), which limits the additional optimization overhead.

The extra computational complexity introduced by PSO can be expressed as

O (N M D_{p})

, where N is the number of particles, M is the number of iterations, and

D_{p}

is the dimension of

Θ_{p}

. Since

D_{p}

is relatively small and the proposed multi-stage search strategy accelerates convergence, the overall computational cost remains controlled. Therefore, the total complexity of PAS can be viewed as a combination of Transformer training and a lightweight global optimization process. Regarding space complexity, PAS does not introduce additional large-scale tensors during inference. The PSO optimization operates only during training and on parameter vectors, resulting in negligible additional memory overhead compared with the Transformer baseline.

6.5.2. Training and Inference Efficiency

To quantitatively assess computational efficiency, we compare PAS with several baseline models in terms of training time per epoch, total training time until convergence, inference latency per sample, and peak GPU memory usage. The results are summarized in Table 10.

As shown in Table 10, traditional statistical models such as ARIMA and machine learning methods like XGBoost achieve the lowest computational cost but relatively limited forecasting accuracy, while deep learning models improve performance at the expense of higher training overhead. Compared with the standard Transformer, PAS introduces moderate additional training cost due to PSO-based global optimization and bilevel learning, but this overhead is lower than that of GA- and SA-optimized Transformers, which require more extensive search iterations. Importantly, PSO is applied only during training, so PAS inference latency remains nearly identical to the standard Transformer, enabling real-time deployment.

In terms of memory consumption, PAS uses slightly more GPU resources than Transformer but stays comparable to other hybrid models such as MOT, indicating limited memory overhead. Overall, PAS achieves a favorable trade-off between forecasting accuracy and computational efficiency, and its optimal-point initialization with adaptive multi-stage search reduces redundant exploration and accelerates convergence, confirming its practicality and scalability for real-world cross-border e-commerce demand forecasting.

6.6. Attention Analysis and Robustness Discussion

To further understand the effectiveness of the PAS framework, we analyze the learned attention weights and examine the robustness of the model under different noise levels and feature perturbations. Attention visualization provides insights into which historical periods and input features the model prioritizes when forecasting demand, while robustness tests evaluate the stability of predictions under uncertain or incomplete data conditions.

(1): Attention Weight Analysis

Figure 7 shows a radar chart of average attention scores across key feature groups: historical sales, promotions, holidays, regional indicators, and user interactions. The results indicate that PAS consistently assigns higher attention to recent sales trends and promotional events, highlighting its ability to capture key demand drivers. Features such as holidays and regional indicators receive moderate attention, reflecting their secondary yet meaningful influence. The attention distribution confirms that PSO-guided learning effectively adjusts the importance of different features based on their contribution to prediction accuracy.

(2): Robustness under Feature Perturbation

To evaluate robustness, we introduce Gaussian noise with standard deviations

σ = 0.01, 0.05, 0.1

to the input features and measure the relative change in MAE and RMSE. Table 11 summarizes the results, showing that PAS maintains relatively stable performance even under moderate noise, with MAE increasing by only 3.2% at

σ = 0.05

. In contrast, baseline models such as Transformer and LSTM exhibit larger performance degradation, highlighting PAS’s superior resilience to input uncertainty.

(3): Discussion

The attention analysis reveals that PAS successfully identifies and emphasizes the most relevant features for demand prediction, such as recent sales and promotions, while also capturing secondary influences like holidays and regional indicators. The robustness experiments demonstrate that PAS is less sensitive to input noise and perturbations compared to conventional deep learning models, suggesting that the integration of PSO-guided attention optimization enhances both model interpretability and stability. This combination of explainable attention and strong robustness makes PAS well suited for real-world cross-border e-commerce scenarios, where data may be noisy or incomplete.

Several relevant research directions lie outside the scope of the current study and remain for future exploration. First, this work focuses on improved PSO for hyperparameter optimization and does not conduct a systematic investigation of alternative optimization methods, including Bayesian optimization, reinforcement learning-based tuning, or other advanced evolutionary and swarm-based algorithms. A comprehensive benchmarking of these strategies would further clarify the relative merits of the proposed PSO variant. Second, PAS is designed as a time-series–driven forecasting model built on the Transformer architecture.

The integration of improved PSO with modeling principles beyond time series analysis—including systems of differential equations, agent-based models (ABM), and alternative deep neural networks (DNN), such as PSO-ABM and PSO-DNN hybrids—has not been explored in this work. Such combinations could expand the applicability of enhanced PSO approaches to more complex dynamic systems and heterogeneous data environments. Future work will examine these extended hybrid frameworks and diverse optimization strategies to further advance demand forecasting methodology in cross-border e-commerce.

7. Conclusions

This research develops the PAS framework to enhance demand prediction in global e-commerce, leveraging PSO feature selection and an attention-oriented prediction model.

Effective Feature Selection: PAS emphasizes critical demand drivers such as recent sales trends and promotional events. Experimental results show that PAS achieves an average MAPE of 8.7%, outperforming Transformer (11.3%), LSTM (12.5%), andmetaheuristic-optimized variants like GA-Transformer (9.1%) and SA-Transformer (9.2%), indicating a significant improvement in prediction accuracy.
Attention-based Interpretability: The attention mechanism reveals that Sales and Promotions contribute 83% and 78% of the overall attention, respectively, while holidays, regional indicators, and user interactions receive moderate attention (42–60%). This validates PAS’s capability to focus on key drivers of cross-border e-commerce demand.
Robust Performance Across Regions: PAS maintains stable performance across different product categories and regions, achieving an average RMSE reduction of 15% compared to baseline models, and demonstrates superior robustness even against GA- and SA-optimized Transformer models, highlighting its generalization ability.

Despite its advantages, the current study has several limitations: (i) PAS primarily relies on structured historical and promotional data, which limits its ability to fully exploit unstructured information sources such as textual product reviews or social media trends. (ii) Although attention scores provide interpretability, they may not fully capture complex feature interactions, particularly in highly dynamic market conditions. (iii) The model requires careful hyperparameter tuning, including PSO optimization parameters and network configurations, which may reduce scalability and generalizability across diverse datasets.

For future work, several directions are promising. First, integrating unstructured data sources, such as customer reviews or social media sentiment, could further enhance prediction accuracy. Second, extending the attention mechanism to capture temporal and hierarchical dependencies may improve interpretability and robustness. Third, incorporating advanced recurrent architectures such as bidirectional long short-term memory (Bi-LSTM) networks into the PAS framework could further improve the modeling of bidirectional temporal dependencies, especially for offline or batch forecasting scenarios. Finally, developing automated hyperparameter optimization and online learning strategies could make PAS more scalable and adaptive, enabling real-time demand prediction for large-scale cross-border e-commerce applications.

Author Contributions

Conceptualization: H.H., J.C. and C.X.; Methodology: H.H. and J.C.; Software: H.H.; Formal analysis: J.C.; Investigation: H.H.; Resources: C.X.; Data curation: J.C.; Writing—originaldraft preparation: H.H. and J.C.; Writing—review and editing: H.H., J.C. and C.X.; Visualization: H.H.; Supervision: C.X.; Project administration: C.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

We evaluated our cross-border e-commerce demand forecasting model using two publicly available datasets: the Kaggle Cross-Border E-Commerce Dataset, containing multi-platform user sessions and sales records (https://www.kaggle.com/datasets/programmer3/cross-border-e-commerce-dataset, accessed on 8 October 2025). and the Antai Cup International E-Commerce Challenge dataset, comprising detailed user purchase histories and product attributes (https://tianchi.aliyun.com/dataset/29170, accessed on 10 October 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ji, Y.; Liew, A.W.-C.; Yang, L. A Novel Improved Particle Swarm Optimization With Long-Short Term Memory Hybrid Model for Stock Indices Forecast. IEEE Access 2021, 9, 23660–23671. [Google Scholar] [CrossRef]
Mejía-Muñoz, J.-M.; Mederos, B.; Avelar, L.; Díaz-Román, J.D.; Cruz-Mejia, O. Demand forecasting using KAN-RNN. Neural Comput. Appl. 2025, 37, 22857–22874. [Google Scholar] [CrossRef]
Zhang, X.; Wu, J.; Yao, J.; Yu, C. Employee incentive and stock liquidity: Evidence from a quasi-natural experiment in China. Int. Rev. Econ. Financ. 2025, 104, 104674. [Google Scholar] [CrossRef]
Velasquez, C.E.; Zocatelli, M.; Estanislau, F.B.G.L.; Castro, V.F. Analysis of time series models for Brazilian electricity demand forecasting. Energy 2022, 247, 123483. [Google Scholar] [CrossRef]
Anderson, M.D.; Sharfi, K.; Gholston, S.E. Direct Demand Forecasting Model for Small Urban Communities Using Multiple Linear Regression. Transp. Res. Rec. 2006, 1981, 114–117. [Google Scholar] [CrossRef]
Sen, N.; Temur, L.O.; Atilla, D.C. Yellow Fever Vaccine Demand Forecasting with ARIMA, SARIMA, Linear Regression and XGBoost. IEEE Access 2024, 12, 197557–197576. [Google Scholar] [CrossRef]
Dash, S.K.; Dash, P.K. Short-term mixed electricity demand and price forecasting using adaptive autoregressive moving average and functional link neural network. J. Mod. Power Syst. Clean Energy 2019, 7, 1241–1255. [Google Scholar] [CrossRef]
Jiang, B.; Ding, G.; Fu, J.; Zhang, J.; Zhang, Y. An Overview of Demand Analysis and Forecasting Algorithms for the Flow of Checked Baggage among Departing Passengers. Algorithms 2024, 17, 173. [Google Scholar] [CrossRef]
Nacef, A.; Mechta, D.; Louail, L.; Harous, S. Advancements in optimization strategies for energy routing, demand response, and load forecasting in energy internet and smart grid: An overview. Energy Effic. 2025, 18, 92. [Google Scholar] [CrossRef]
Zhang, X.; Xu, R.; Nguyen, H.T.; Zhou, H. Mandatory information disclosure regulation and corporate cash holdings: Evidence from a quasi-natural experiment. Int. Rev. Financ. 2026, 26, e70065. [Google Scholar] [CrossRef]
Jiang, J.; Yu, L.; Zhang, X.; Ding, X.; Wu, C. EV-Based Reconfigurable Smart Grid Management Using Support Vector Regression Learning Technique Machine Learning. Sustain. Cities Soc. 2022, 76, 103477. [Google Scholar] [CrossRef]
Tian, L.; Wang, X. A Dynamic Prediction Neural Network Model of Cross-Border e-Commerce Sales for Virtual Community Knowledge Sharing. Complexity 2022, 2022, 2529372. [Google Scholar] [CrossRef]
Pan, H.; Zhou, H. Study on convolutional neural network and its application in data mining and sales forecasting for E-commerce. Electron. Commer. Res. 2020, 20, 297–320. [Google Scholar] [CrossRef]
Akopov, A.S. A Hybrid Multi-Swarm Particle Swarm Optimization Algorithm for Solving Agent-Based Epidemiological Model. Cybern. Inf. Technol. 2025, 25, 59–77. [Google Scholar] [CrossRef]
Chen, Y.; Li, L.; Xiao, J.; Yang, Y.; Liang, J.; Li, T. Particle swarm optimizer with crossover operation. Eng. Appl. Artif. Intell. 2018, 70, 159–169. [Google Scholar] [CrossRef]
Sohrabpour, V.; Oghazi, P.; Toorajipour, R.; Nazarpour, A. Export sales forecasting using artificial intelligence. Technol. Forecast. Soc. Change 2021, 163, 120480. [Google Scholar] [CrossRef]
Yang, Y. Data-driven prediction of future purchase behavior in cross-border e-commerce using sequence modeling with PSO-tuned LSTM. PLoS ONE 2025, 20, e0337932. [Google Scholar]
He, Z.; Chen, Y.; Shang, Z.; Li, C.; Li, L.; Xu, M. A novel wind speed forecasting model based on moving window and multi-objective particle swarm optimization algorithm. Appl. Math. Model. 2019, 76, 717–740. [Google Scholar] [CrossRef]
Ni, L.; Huang, Z.; Fu, N. A Stacking-Based Fusion Framework for Dynamic Demand Forecasting in E-Commerce. Mathematics 2025, 13, 3436. [Google Scholar] [CrossRef]
Niu, B.; Chen, K.; Chen, L.; Ding, C.; Yue, X. Strategic Waiting for Disruption Forecasts in Cross-Border E-Commerce Operations. Prod. Oper. Manag. 2021, 30, 2678–2693. [Google Scholar] [CrossRef]
Dong, H.; Wang, D.; Bashar, S. E-Commerce Supply Chain Resilience and Sustainability Through AI-Driven Demand Forecasting and Waste Reduction. Sustainability 2026, 18, 360. [Google Scholar] [CrossRef]
Li, Z.; Zhang, N. Short-Term Demand Forecast of E-Commerce Platform Based on ConvLSTM Network. Comput. Intell. Neurosci. 2022, 5227829. [Google Scholar] [CrossRef]
Wang, X.; Wang, L.; Liang, N. Machine learning-driven power demand forecasting models for optimized power management. Electr. Eng. 2025, 107, 12069–12094. [Google Scholar] [CrossRef]
Kim, M.; Choi, W.; Jeon, Y.; Liu, L. A Hybrid Neural Network Model for Power Demand Forecasting. Energies 2019, 12, 931. [Google Scholar] [CrossRef]
Uzlu, E.; Dede, T. Development of hybrid models to forecast water demand in the city of Ankara. Soft Comput. 2025, 29, 5401–5414. [Google Scholar] [CrossRef]
Wang, D.; Tan, D.; Liu, L. Particle swarm optimization algorithm: An overview. Soft Comput. 2018, 22, 387–408. [Google Scholar] [CrossRef]
Ma, G.; Zhou, W.; Chang, X. A novel particle swarm optimization algorithm based on particle migration. Appl. Math. Comput. 2012, 218, 6620–6626. [Google Scholar] [CrossRef]
Li, Q.; Yu, M. Achieving Sales Forecasting with Higher Accuracy and Efficiency: A New Model Based on Modified Transformer. J. Theor. Appl. Electron. Commer. Res. 2023, 18, 1990–2006. [Google Scholar] [CrossRef]
Box, G.E.P.; Jenkins, G.M.; Reinsel, G.C.; Ljung, G.M. Time Series Analysis: Forecasting and Control, 3rd ed.; Prentice Hall: Englewood Cliffs, NJ, USA, 1976. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Sukel, M.; Rudinac, S.; Worring, M. Multimodal Temporal Fusion Transformers are Good Product Demand Forecasters. IEEE Multimedia 2024, 31, 48–60. [Google Scholar] [CrossRef]
Nayak, G.H.H.; Alam, M.W.; Naik, B.S.; Varshini, B.S.; Avinash, G.; Kumar, R.R.; Ray, M.; Singh, K.N. Meta-transformer: Leveraging Metaheuristic Algorithms for Agricultural Commodity Price Forecasting. J. Big Data 2025, 12, 138. [Google Scholar] [CrossRef]
Liu, Y.; Hu, T.; Zhang, H.; Wu, H.; Wang, S.; Ma, L.; Long, M. iTransformer: Inverted Transformers Are Effective for Time Series Forecasting. arXiv 2023, arXiv:2310.06625. [Google Scholar]
Wang, T.; Dong, Y.W.; Zhang, T.; Wang, Q. HTMformer: Hybrid Time and Multivariate Transformer for Time Series Forecasting. arXiv 2025, arXiv:2510.07084. [Google Scholar] [CrossRef]
Xu, J.; Wu, C.; Li, Y.-F.; Danoy, G.; Bouvry, P. A Unified Hyperparameter Optimization Pipeline for Transformer-Based Time Series Forecasting Models. arXiv 2025, arXiv:2501.01394. [Google Scholar] [CrossRef]

Figure 1. PAS framework: analytic hierarchy process for demand forecasting in cross-border e-commerce.

Figure 2. Improved particle swarm optimization algorithm flow framework.

Figure 3. Comparison of forecasting performance (MAE, RMSE,

R^{2}

) for Dataset 1 (Kaggle) and Dataset 2 (Antai Cup).

Figure 3. Comparison of forecasting performance (MAE, RMSE,

R^{2}

) for Dataset 1 (Kaggle) and Dataset 2 (Antai Cup).

Figure 4. Comparison of PAS predicted demand with actual demand for Dataset 1 (Kaggle) and Dataset 2 (Antai Cup). Solid lines with markers denote actual demand; dashed lines with markers denote predicted demand.

Figure 5. Multi-horizon forecasting performance (MAE and RMSE). Solid lines indicate MAE, dashed lines indicate RMSE.

Figure 6. Ablation study results of PAS framework on Dataset 1 (solid lines) and Dataset 2 (dashed lines) across four metrics (MAE, RMSE, MAPE, SMAPE). This line chart highlights the contribution of each component and allows direct comparison between datasets.

Figure 7. Enhanced attention score comparison across key feature groups. Gradient 3D-style bars, subtle background, and data labels improve visual impact. PAS shows stronger attention on Sales and Promotions.

Table 1. Descriptive statistics, missing values, and sequence characteristics of Kaggle and Antai Cup datasets.

Dataset	Mean	Std	Median	Max	Min	CV
Kaggle	128.46	62.31	119.23	238.75	12.61	0.485
Antai Cup	143.72	75.18	132.58	256.30	9.47	0.523
Dataset	Demand Missing (%)	Feature Missing (%)	Seq. Length (days)	Volatility
Kaggle	0.21	0.87–1.32	30–180	Moderate
Antai Cup	0.17	0.69–1.18	30–180	Slightly Higher

Table 2. System and software specification.

Category	Specification
CPU	Intel Core i9-13900K, 24 cores (16P + 8E), 3.0–5.8 GHz
Memory	64 GB DDR5 RAM
GPU	NVIDIA RTX 4090, 24 GB GDDR6X
Operating System	Ubuntu 22.04 LTS
Programming Language	Python 3.10
Deep Learning Framework	PyTorch 2.1
Data Processing Libraries	NumPy 1.25, Pandas 2.1
Optimization	Custom PSO implementation with SciPy 1.12
Reproducibility	Fixed random seed, GPU acceleration enabled

Table 3. Hyperparameter settings for PAS model.

Component	Hyperparameter	Value/Range
Transformer Encoder–Decoder	Number of layers	4
	Number of attention heads	8
	Embedding dimension	256
	Feed-forward dimension	512
	Dropout rate	0.1
PSO Optimization	Population size N	50
	Maximum iterations M	200
	Inertia weight $ω$	0.7
	Acceleration coefficients $c_{1}$ , $c_{2}$ , $c_{3}$	1.5, 1.5, 0.8
	Lévy flight coefficient	0.05
Training	Batch size	128
	Learning rate (Transformer)	$1 \times 10^{- 4}$
	Loss function	Mean Squared Error (MSE)
	Optimizer	Adam

Table 4. Baseline model architectures and hyperparameter search ranges.

Model	Layers/Depth	Hidden Units/Trees	Dropout/Rate	Other Key Settings
ARIMA	N/A	$(p, d, q)$ via AIC/BIC	N/A	Seasonal ARIMA if needed
XGBoost	N/A	100–500 trees	N/A	LR 0.01–0.2, max depth 3–7
LSTM	1–2	32–128 units	0.2–0.3	Seq length same as PAS, Adam
Transformer	2–3	128	0.1–0.4	4 heads, batch size same as PAS
TFT	2–3	128	0.1–0.4	Attention + gating, Adam
MOT	2–3	128	0.1–0.4	Multi-objective optimization

Table 5. Forecasting performance comparison on Dataset 1 (Kaggle Cross-Border E-Commerce Dataset).

Model	MAE	RMSE	MAPE (%)	SMAPE (%)	$R^{2}$
ARIMA	58.91	67.35	22.3	20.1	0.618
XGBoost	49.38	56.72	16.8	15.2	0.735
LSTM	44.16	51.28	13.5	12.4	0.789
Transformer	30.99	39.48	9.7	8.9	0.876
GA	29.05	35.21	9.0	8.7	0.884
SA	29.47	35.64	9.1	8.8	0.881
TFT	28.72	36.65	9.09	8.31	0.882
MOT	27.45	34.21	8.6	7.9	0.901
PAS (Ours)	25.18	31.47	8.7	7.6	0.923

Table 6. Forecasting performance comparison on Dataset 2 (Antai Cup International E-Commerce Challenge Dataset).

Model	MAE	RMSE	MAPE (%)	SMAPE (%)	$R^{2}$
ARIMA	55.42	63.89	20.7	18.5	0.643
XGBoost	46.15	53.27	15.3	13.8	0.758
LSTM	41.39	48.52	12.1	11.0	0.807
Transformer	28.64	36.93	8.8	8.0	0.892
GA	27.12	31.05	8.1	7.5	0.908
SA	27.38	31.42	8.2	7.6	0.905
TFT	26.78	34.57	8.3	7.5	0.905
MOT	26.86	33.71	8.14	7.4	0.912
PAS (Ours)	26.85	29.73	7.5	6.9	0.931

Table 7. Multi-horizon forecasting performance comparison on Dataset 1 (Kaggle).

Model	1-Day		3-Day		7-Day		14-Day
Model	MAE	RMSE	MAE	RMSE	MAE	RMSE	MAE	RMSE
ARIMA	28.7	35.1	31.4	39.8	37.2	45.3	42.9	53.2
LSTM	26.3	32.8	28.5	36.1	33.0	42.2	37.4	49.5
Transformer	23.8	30.1	25.7	33.6	28.5	38.9	29.8	41.0
TFT	22.9	29.3	24.6	30.0	25.8	36.2	30.1	40.5
PAS (Ours)	21.0	27.4	22.3	30.3	26.2	35.6	30.5	40.0

Table 8. Ablation study results on Dataset 1 and Dataset 2 across multiple metrics.

Model Variant	Dataset 1 (Kaggle)				Dataset 2 (Antai Cup)
Model Variant	MAE	RMSE	MAPE (%)	SMAPE (%)	MAE	RMSE	MAPE (%)	SMAPE (%)
PAS (Full Model)	25.18	31.47	8.2	7.6	23.85	29.73	7.5	6.9
w/o PSO	28.01	36.80	9.3	8.7	27.50	34.85	9.0	8.4
w/o Multi-Stage Search	27.62	35.21	8.9	8.2	26.95	33.10	8.5	7.9
w/o Attention Mechanism	26.95	33.88	8.5	7.9	25.70	31.85	7.9	7.2

Table 9. Convergence and stability analysis results on Dataset 1 and Dataset 2.

Model	Dataset 1 (Kaggle)		Dataset 2 (Antai Cup)
Model	MAE (Final)	RMSE (Final)	MAE (Final)	RMSE (Final)
PAS (Full Model)	25.18	31.47	23.85	29.73
w/o Improvement PSO	27.30	34.75	26.00	32.90
w/o Multi-Stage Search	26.50	33.50	25.50	31.20
w/o Attention Mechanism	26.85	33.85	25.80	31.95
Standard Deviation (Variance)	Dataset 1		Dataset 2
Standard Deviation (Variance)	MAE	RMSE	MAE	RMSE
PAS (Full Model)	0.48	0.56	0.51	0.62
w/o Improvement PSO	1.01	1.18	0.95	1.10
w/o Multi-Stage Search	0.91	1.04	0.85	0.95
w/o Attention Mechanism	0.93	1.06	0.88	1.01

Table 10. Computational efficiency comparison of PAS and baseline models.

Model	Training Time/Epoch (s)	Total Training Time (min)	Inference Latency (ms)	Peak GPU Memory (MB)
ARIMA	0.8	2.1	0.5	680
XGBoost	1.2	3.5	0.8	920
LSTM	3.6	18.6	2.1	2845
Transformer	4.2	22.3	2.6	4128
GA	6.8	41.5	2.7	4300
SA	7.0	43.2	2.7	4350
TFT	5.1	29.7	2.9	4560
MOT	5.5	32.4	2.8	4480
PAS (Ours)	5.8	34.6	2.7	4350

Table 11. Robustness evaluation under feature perturbation (Dataset 1, MAE/RMSE).

Model	$σ = 0.01$	$σ = 0.05$	$σ = 0.1$
LSTM	45.2/52.1	48.7/56.4	53.9/62.1
Transformer	31.5/39.9	34.2/42.5	38.0/47.1
TFT	29.1/37.1	31.6/40.3	35.2/44.6
PAS (Ours)	25.4/31.8	26.2/32.7	27.8/34.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hu, H.; Cai, J.; Xu, C. PAS: A Novel Attention-Enhanced Particle Swarm Optimization Model for Demand Forecasting in Cross-Border E-Commerce. Appl. Sci. 2026, 16, 3386. https://doi.org/10.3390/app16073386

AMA Style

Hu H, Cai J, Xu C. PAS: A Novel Attention-Enhanced Particle Swarm Optimization Model for Demand Forecasting in Cross-Border E-Commerce. Applied Sciences. 2026; 16(7):3386. https://doi.org/10.3390/app16073386

Chicago/Turabian Style

Hu, Hao, Jinshun Cai, and Chenke Xu. 2026. "PAS: A Novel Attention-Enhanced Particle Swarm Optimization Model for Demand Forecasting in Cross-Border E-Commerce" Applied Sciences 16, no. 7: 3386. https://doi.org/10.3390/app16073386

APA Style

Hu, H., Cai, J., & Xu, C. (2026). PAS: A Novel Attention-Enhanced Particle Swarm Optimization Model for Demand Forecasting in Cross-Border E-Commerce. Applied Sciences, 16(7), 3386. https://doi.org/10.3390/app16073386

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

PAS: A Novel Attention-Enhanced Particle Swarm Optimization Model for Demand Forecasting in Cross-Border E-Commerce

Abstract

1. Introduction

2. Related Work

2.1. Demand Forecasting in E-Commerce

2.2. Evolutionary Optimization in Forecasting Models

2.3. Hybrid Models for Demand Forecasting

3. Preliminary

3.1. PSO Algorithm

3.2. Transformer Model

3.3. Time Series and Demand Forecasting Basics

3.4. Feature Modeling in Demand Forecasting

3.5. Motivation for PSO–Transformer Integration

4. Methodology

4.1. Overview of the PAS Framework

4.2. Improved Particle Swarm Optimization with Optimal-Point Initialization

4.3. Multi-Stage Search Strategy of PAS

4.4. Transformer-Based Demand Forecasting Model

4.5. PSO-Guided Attention Learning Mechanism

4.6. Model Training and Optimization Procedure

5. Experimental Environment

5.1. Dataset

5.1.1. Dataset Description

5.1.2. Dataset Statistics and Comparability

5.2. Experimental Setup

5.3. Parameter Settings

5.4. Evaluation Metrics

5.5. Baseline Models for Comparison

5.5.1. Comparison Method

5.5.2. Baseline Model Architectures and Hyperparameter Settings

6. Results

6.1. Overall Forecasting Performance

6.1.1. Quantitative Performance Comparison

6.1.2. Visual Performance Comparison

6.2. Multi-Horizon Forecasting Results

6.3. Ablation Study

6.4. Convergence and Stability Analysis

6.5. Computational Efficiency and Complexity Analysis

6.5.1. Theoretical Complexity

6.5.2. Training and Inference Efficiency

6.6. Attention Analysis and Robustness Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI