4.1. Experimental Setup
To ensure that the evaluation results possess broad representativeness and practical significance, the experiment first constructs simulated urban scenarios encompassing diverse dimensional features. Regarding the problem scale, four test gradients with strictly increasing difficulty are designed; the combinations of demand point quantities and planning cycles are configured as illustrated in
Table 3. These scale gradients are explicitly defined to mirror real-world administrative hierarchies:
represents community-level micro-planning,
corresponds to district-level scheduling, while
and
simulate complex city-wide and metropolitan logistics networks, respectively. These node configurations are consistent with standard benchmarks in the recent large-scale WEEE logistics literature [
25,
36], ensuring the evaluation covers the full spectrum of spatial complexity. This configuration aims to simulate the entire realistic spectrum, ranging from community-level micro short-term planning to city-level macro long-term scheduling, while rigorously evaluating the stability of the model under varying spatiotemporal complexities.
In terms of spatial distribution characteristics and network topology construction, a hierarchical design strategy is adopted to simulate the clustering features of population and commercial activities in real cities. First, to model the distribution of demand points, the experiment employs a GMM to construct 3–5 random hotspot areas or a pronounced polycentric distribution within a unit square. This approach simulates the population density variations and spatial heterogeneity ranging from core commercial districts to suburban areas. Furthermore, to reflect the rational characteristics of logistics facility planning, the generation of the candidate site set I abandons simple random point distribution in favor of a clustering-based strategy. Specifically, for the generated non-uniform demand distribution, the K-Means algorithm is applied to extract cluster centroids as the base coordinates for candidate sites. This strategy ensures that candidate facilities are inherently located at the geometric centers of high-density demand areas, representing optimal logistics nodes with the advantage of minimizing potential transportation costs, thereby constructing a spatial topology with high practical relevance.
To accurately simulate the time-varying nature and non-stationarity of WEEE generation in real-world scenarios, this study forgoes traditional static or purely random demand assumptions. Instead, a dynamic demand generation mechanism predicated on the decomposition of trend, seasonal, and stochastic components is developed. Specifically, the WEEE generation volume
at each demand point
j during period
t is derived from the following mixed process:
In this formulation, denotes the baseline demand level of the node. The dynamic nature of demand is captured through several constituent components: the sinusoidal term, with modulating the seasonal intensity, simulates the periodic oscillation of electronic product disposal to reflect cyclical fluctuations driven by replacement cycles or market activities; the linear term incorporates a growth rate to model the long-term upward trajectory of e-waste volume; and represents Gaussian random noise utilized to account for unpredictable daily fluctuations.
This design not only aligns the testing environment with the fluctuation characteristics of real-world supply chains but, more pivotally, it also constructs a data environment with significant time-series dependency features. This allows for the effective verification of the LSTM module’s capacity within the framework to capture long-term dependencies and non-linear spatiotemporal patterns, determining whether the model has truly acquired the adaptive intelligence required for dynamic demand-based allocation.
To evaluate the total operational expenditures over multiple periods and assess the economic feasibility of the facility location configuration, the experiment establishes a set of benchmark operational parameters. These values are calibrated with reference to the extant literature on WEEE reverse logistics [
37] and empirical survey data. Specifically, the site construction cost
is modeled as a fixed cost to reflect the initial investment in land and infrastructure [
38], while the unit transportation cost
is defined as a linear function of the Euclidean distance. In particular, to evaluate algorithmic performance within resource-constrained environments, the facility capacity
is modeled as a dynamic constraint correlated with the regional average demand [
36]. This configuration necessitates a strategic trade-off between load distribution across multiple facilities and centralized processing at fewer facilities [
39]. The comprehensive baseline parameter settings are summarized in
Table 4.
The operational parameters in
Table 4 were calibrated based on standard values in established WEEE literature [
37,
38] to ensure economic realism and comparability. For the deep learning hyperparameters in
Table 5, initial values were determined via a preliminary grid search within ranges recommended by recent studies. The optimality of these selected key parameters is further rigorously verified through the sensitivity analysis presented in
Section 4.2.
The training and testing of the deep neural networks were conducted on a high-performance computing platform equipped with an NVIDIA GeForce RTX 4090 GPU (24 GB) and CUDA 12.8. The detailed hyperparameter configurations for model training are presented in
Table 5.
This section selects three categories of baseline methods for comparative analysis to establish a rigorous and objective performance evaluation framework. First, exact solvers such as Gurobi and OR-Tools directly resolve the MILP model through the branch-and-bound method; notwithstanding their significant computational requirements, they provide ground-truth optimal solutions against which the optimality gap is measured. Second, metaheuristic algorithms including GA and PSO are incorporated as representative mainstream benchmarks in the engineering domain for addressing such NP-hard problems. Finally, pure reinforcement learning baselines such as DQN and PPO are utilized to represent end-to-end learning paradigms that do not incorporate hybrid search mechanisms, thereby verifying the necessity of the hybrid optimization strategy proposed in this study.
To quantify model performance across multiple dimensions and address the core scientific questions posed earlier, this section establishes a comprehensive evaluation framework encompassing economic feasibility, operational efficiency, and computational characteristics. Within the fundamental economic dimension, Total Expected Cost (TEC) is utilized as the primary metric for evaluating the viability of the proposed schemes. This is supplemented by a granular breakdown and statistical analysis of construction, operation, transportation, and penalty costs to rigorously examine the rationality of the cost structure.
To further investigate the model’s trade-off mechanism across multi-dimensional objectives, this study examines the coupling relationship between service levels and resource allocation efficiency. Specifically, the Average Service Rate (ASR) is introduced as a core indicator to quantify the social benefits of the recycling network. This metric is defined as the proportion of actually collected volume relative to the total dynamic demand
within the planning horizon:
where
represents the unmet demand determined after decision-making, intuitively reflecting the system’s responsiveness to time-varying demand and its service breadth. Complementary to this is the Capacity Utilization (CU), which aims to monitor the actual load levels of active stations. This allows the model to be scrutinized for whether it has truly acquired “on-demand allocation” dynamic adjustment strategies, effectively avoiding resource waste caused by idle facilities or service bottlenecks resulting from overloads. Finally, given the stringent real-time requirements of large-scale dynamic scheduling, Average Decision Time is used as the key criterion for assessing computational efficiency. This is combined with the cumulative reward convergence trajectory during the DRL training process to comprehensively verify the engineering suitability of the LAtt-PR framework from the perspectives of timeliness and learning stability.
4.2. Comparative Analysis of Overall Performance
Based on the experimental setup described above,
Table 6 details the comprehensive performance of each algorithm across various problem scales, encompassing three key metrics: objective function value (Obj.), relative optimality gap (Gap), and computational time (Time).
It can be clearly observed that as the problem scale expands from to , the performance of the algorithms exhibits a significant divergence trend. While exact solvers such as Gurobi consistently identify the global optimal solution, their computational cost escalates exponentially with scale, reaching 10.61 s at . This surge in time complexity renders them unsuitable for real-world demands involving larger-scale instances or real-time dynamic scheduling. In contrast, traditional metaheuristics including GA and PSO encounter severe performance bottlenecks when handling large-scale complex facility location problems. At , their optimality gaps expand sharply to over 16%, indicating that relying solely on random search mechanisms in high-dimensional discrete solution spaces easily leads to local optima and hinders convergence to high-quality solutions. Pure reinforcement learning methods such as DQN and PPO, although achieving extremely fast solving speeds of less than 0.7 s by leveraging neural network inference, remain limited in precision due to the absence of fine-grained local search mechanisms, maintaining a gap of approximately 9% in large-scale scenarios. Notably, the LAtt-PR framework demonstrates superior robustness across all tested scales. Particularly in the most challenging large-scale scenario where , LAtt-PR successfully maintains the optimality gap within 3.98%, representing a performance improvement of approximately 76% compared to GA and PSO and 55% compared to pure RL. Furthermore, its execution time is only 1.71 s, which is approximately 16% of the duration required by Gurobi. These results powerfully demonstrate that LAtt-PR, by organically integrating the fast inference capability of DRL with the local refinement capability of PSO, successfully strikes an optimal balance between solving efficiency and solution quality.
From the perspective of computational sustainability, the LAtt-PR framework offers a distinct advantage over traditional exact solvers. While the offline training of the DRL agent incurs a fixed computational overhead, this is a one-time investment. In contrast, exact solvers like Gurobi rely on Branch-and-Bound algorithms with exponential time complexity, leading to prohibitive energy consumption as the network scale expands.
LAtt-PR adopts a “Train-Once-Deploy-Everywhere” paradigm. Once trained, the model executes inference in polynomial time, enabling real-time decision-making with minimal energy expenditure. For large-scale instances, LAtt-PR achieves a 6× speedup compared to Gurobi, significantly reducing the computational carbon footprint required for routine dynamic scheduling. This characteristic ensures the scalability of the system for metropolitan-level applications, aligning with the principles of sustainable computing by delivering high-quality solutions without the excessive resource consumption typical of combinatorial optimization.
The statistical significance of the observed performance disparities was rigorously evaluated using paired t-tests. As evidenced in
Table 7, the calculated p-values across all baseline comparisons consistently fall below the 0.01 threshold. These results provide sufficient statistical evidence to reject the null hypothesis, confirming that the superiority of LAtt-PR is significant at the 99% confidence level and not attributable to stochastic variance.
To further demonstrate the advantages of the LAtt-PR framework regarding solution space structure optimization, this section systematically evaluates the performance of various algorithms in large-scale scenarios (
and
). Leveraging the statistical results in
Table 8 and the visualizations in
Figure 6, we comprehensively assess operational efficiency indicators (ASR, CU) and the refined cost structure.
In the most challenging ultra-large-scale scenario
, LAtt-PR demonstrates superior operational efficiency; as illustrated in
Figure 6, its ASR curve remains consistently near 100%, mirroring the performance of Gurobi, while GA and PSO decline to approximately 94%. This performance disparity is elucidated by the cost composition: GA and PSO incur high penalty costs exceeding 6.5% due to their myopic attempts to minimize initial construction costs at approximately 15.1%, which leads to severe supply shortages. Conversely, LAtt-PR maintains a balanced construction investment of 16.6% to effectively eliminate penalties, reducing them to only 0.3%, thereby achieving an optimal equilibrium between cost-effectiveness and service quality. This contrast indicates that while traditional metaheuristics are prone to becoming trapped in low-construction-cost local optima, LAtt-PR successfully bypasses such short-sighted strategies within high-dimensional discrete solution spaces.
The advantage of LAtt-PR is fundamentally attributed to its unique spatio-temporal attention-prediction mechanism. On the one hand, the integrated MHA effectively extracts spatial dependencies between demand points to guide facility locations toward high-density areas; on the other hand, the DRL Critic network successfully maps current construction investments to long-term service returns. This endows the agent with proactive decision-making capabilities, allowing it to assume necessary immediate costs to preemptively avoid high future penalties. By combining this forward-looking optimization with PSO’s local refinement, LAtt-PR identifies the global service-cost balance in dynamic environments rather than merely pursuing localized cost minimization.
To further validate the learning stability and convergence efficiency of the algorithm,
Figure 7 illustrates the TEC trajectories of LAtt-PR compared with pure RL baselines during the training process under the
scale.
LAtt-PR exhibits significantly superior convergence characteristics throughout the training process: regarding convergence speed, the cost curve of LAtt-PR enters a plateau phase after approximately 50 training epochs, whereas PPO and DQN require roughly 100 and 125 epochs, respectively, to reach a comparable steady state, indicating that the local fine-grained search of PSO provides high-quality, low-variance gradient signals for policy updates, thereby effectively accelerating the exploration process within the policy space; furthermore, in terms of stability, the reward curve of LAtt-PR exhibits significantly smaller fluctuations and follows a smooth downward trend, while the curves for PPO and DQN display multiple pronounced oscillations and performance regressions during the mid-training phase, reflecting the issues of policy degradation and high variance inherent in pure reinforcement learning exploration within high-dimensional discrete action spaces.
To rigorously evaluate the model’s resilience against unforeseen demand fluctuations, a sensitivity analysis was conducted by varying the intensity of the stochastic noise term
. The results, summarized in
Table 9, reveal distinct performance trajectories under escalating uncertainty levels (
). While all algorithms exhibit increased costs due to higher volatility, LAtt-PR demonstrates superior stability, limiting the cost increment to 20.39% even under high-noise conditions. In contrast, the PPO and GA baselines show significantly higher sensitivity, with performance degradation rates of 30.43% and 36.16%, respectively. This disparity highlights that while metaheuristics struggle to adapt to stochastic perturbations, the proposed hybrid framework effectively leverages its LSTM module to filter high-frequency noise, thereby ensuring robust long-term decision-making.
To validate the rationale behind the selected parameter configurations, a sensitivity analysis was conducted on three pivotal hyperparameters: the penalty coefficient for unserved waste (), the learning rate of the DRL agent (), and the inertia weight strategy of the PSO module ().
As presented in
Table 10, the results indicate that
achieves the optimal equilibrium, maintaining an ASR of 98.9% while minimizing the TEC. A lower
fails to sufficiently penalize unmet demand, dropping the ASR to 92.1%, whereas
forces excessive infrastructure construction, inflating the TEC by 13.8%. Regarding the learning rate,
demonstrates the most stable convergence; larger rates (
) lead to oscillation, while smaller rates (
) suffer from slow convergence. Finally, the adaptive linear decay strategy for PSO weights (
) yields a 2.3% cost reduction compared to a fixed weight strategy (
), confirming the benefit of dynamic exploration-exploitation balancing.
The aforementioned convergence analysis further validates the effectiveness of the LAtt-PR hybrid architecture: the DRL module is responsible for learning high-level state representations and policy skeletons, while the PSO module acts as an online policy refiner that continuously provides improved action samples through local search, thereby guiding policy updates toward superior directions. This synergistic mechanism not only enhances the quality of single-period decisions but also significantly mitigates common reinforcement learning issues in combinatorial optimization—such as training instability and slow convergence—by reducing the variance of policy gradient estimates, providing a reliable learning framework for efficient and robust dynamic facility location. To further provide a visual demonstration of the LAtt-PR performance, this paper visualizes the solutions generated under the S4 scale in
Figure 8; as the iterative cycles progress, the algorithm incrementally increases the number of sites while significantly reducing the average service distance, achieving a favorable balance between coverage and efficiency.
To further assess the practical applicability of the proposed framework in realistic urban environments, a case study was conducted based on the real-world topology of Zhongguancun, Beijing. By extracting geospatial data for 50 residential communities and 15 industrial zones via OpenStreetMap, we constructed a faithful representation of a metropolitan logistics network. The multi-period evolution of the optimized layout is visualized in
Figure 9.
As visualized in the figure, the spatiotemporal evolution of the facility network aligns with the logical expansion of urban demand. In the Initial state, waste sources are densely clustered in the residential core. During Period 1, the model strategically activates two primary facilities in central locations to maximize immediate coverage. As demand grows in subsequent periods, the network dynamically expands outward, activating new facilities in peripheral zones to alleviate capacity bottlenecks. The dense web of allocation lines demonstrates efficient load balancing across the network. This trajectory confirms that LAtt-PR can autonomously generate a hierarchical, cost-effective infrastructure layout that adapts to the complex, irregular topology of real-world metropolitan environments.
4.3. Ablation Experiments for Critical Components
After establishing the overall performance advantages of the LAtt-PR framework, this paper designs two sets of controlled experiments to further dissect the internal sources of its superior performance and verify the necessity of each core module’s design, specifically exploring the contributions of the hybrid optimization strategy and the spatiotemporal neural network components. First, focusing on the synergistic gains of the hybrid search strategy, this study aims to quantify the local refinement contribution of the PSO module within the framework by constructing an ablation variant, LAtt-PR w/o PSO, which eliminates the back-end population search process and directly employs the probability distribution output by the DRL policy network for sampling decisions.
To provide a visual demonstration of the universal performance gains achieved through hybrid search across varying problem scales,
Figure 10 illustrates the cost convergence trajectories of the full LAtt-PR model compared to the variant excluding PSO, denoted as w/o PSO, across four scenarios
to
. It is clearly observable that the hybrid strategy outperforms the ablation variant in both convergence velocity and solution quality. Specifically, LAtt-PR, represented by the red line, exhibits superior convergence efficiency across all tested scales. Compared to the w/o PSO variant, indicated by the blue line, which relies solely on policy gradients for exploration, the integration of PSO enables the algorithm to achieve a steeper descent during the initial training stages and ultimately reach a lower steady-state cost. More crucially, as the problem scale expands from
in
Figure 10a to
in
Figure 10d, the performance gap between the two models shows a significant non-linear widening trend.
Within the small-scale scenario , the solution space remains relatively compact, which permits pure DRL to approximate the optimal solution; consequently, the marginal utility of PSO refinement is constrained, leading to closely aligned performance curves. Conversely, in the ultra-large-scale scenario , which is characterized by an exponentially expanding solution space, pure DRL frequently fails to identify global extrema, resulting in convergence stagnation at elevated cost levels. In this context, the swarm intelligence search mechanism of PSO becomes decisive. By executing high-density local optimization within the promising neighborhoods identified by DRL, it significantly diminishes the final objective cost. This phenomenon profoundly underscores the necessity of the synergistic coarse-grained guidance and fine-grained polishing framework: DRL facilitates rapid pruning and yields high-quality initial search points, which effectively prevents PSO from becoming trapped in local optima within high-dimensional spaces; in turn, PSO compensates for the inherent precision limitations of DRL during the final refinement stage, ensuring that the architecture maintains a competitive advantage even in large-scale and complex environments.
To verify the structural robustness and functional necessity of the core neural components within the LAtt-PR encoder-decoder architecture—specifically the efficacy of the parallel fusion mechanism integrating FFN and GAT—this section establishes a series of ablation experiments. These investigations encompass a baseline model
and six variant models
through
, as detailed in
Table 11. The experiments were executed at a standardized scale of
while maintaining uniform training hyperparameters. By evaluating the solution quality and computational efficiency across these diverse network architectures, this study aims to rigorously quantify the marginal contribution of each specific component to the overall performance of the framework.
Based on the ablation results presented in
Figure 11, the full model
achieves the optimal TEC of 22.39, significantly outperforming all comparative variants. This robustly validates the structural rationality of the LAtt-PR architecture. First, examining variants related to the parallel feature extraction mechanism (
–
), it is observed that
suffers the most severe performance degradation, with costs surging to 25.36, representing a 13.25% gap. This indicates that the absence of spatial topological information causes the model to lose its perception of proximity effects, leading to geographically irrational facility location schemes. Meanwhile, the performance decline of
, showing a 6.02% gap, reveals the potential over-smoothing issues caused by sole reliance on it, emphasizing the importance of preserving node intrinsic features. While the serial structure
with a 3.08% gap performs better than the single-branch variants, it still falls short of
, further confirming the advantages of the parallel fusion mechanism in maximizing the purity of multi-source features.
The ablation of temporal modules highlights the critical role of long-range memory. The cost for increases to 24.91, a rise of 11.24%, suggesting that a simple Multi-Layer Perceptron cannot effectively capture the dynamic evolution trends of WEEE generation. , which completely ignores future information, exhibiting the worst performance, with a gap of 18.49%. Its extremely short solving time of 0.45 s is achieved at the expense of long-term planning capabilities, confirming that temporal modeling is decisive in preventing blind initial construction and subsequent capacity shortages. Finally, the variant shows a 4.26% performance setback; this implies that while GAT effectively processes local neighborhood information, MHA remains irreplaceable for capturing long-distance dependencies across regions and ensuring global logistics coordination. In summary, the core components of LAtt-PR are not merely stacked together; instead, through organic synergy, they collectively guarantee the model’s robustness and superiority in complex dynamic environments.