LAtt-PR: Hybrid Reinforced Adaptive Optimization for Conquering Spatiotemporal Uncertainties in Dynamic Multi-Period WEEE Facility Location

Qu, Zelin; Ye, Xiaoyun; Zhang, Yuanyuan; Wang, Jinlong

doi:10.3390/math14040612

Open AccessArticle

LAtt-PR: Hybrid Reinforced Adaptive Optimization for Conquering Spatiotemporal Uncertainties in Dynamic Multi-Period WEEE Facility Location

School of Information and Control Engineering, Qingdao University of Technology, Qingdao 266520, China

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(4), 612; https://doi.org/10.3390/math14040612

Submission received: 20 January 2026 / Revised: 1 February 2026 / Accepted: 7 February 2026 / Published: 10 February 2026

(This article belongs to the Special Issue Recent Advances in Adaptive Control Theory and Applications for Nonlinear Systems)

Download

Browse Figures

Versions Notes

Abstract

The escalating global surge in Waste Electrical and Electronic Equipment (WEEE) necessitates the strategic deployment of recycling facilities within resilient, multi-period networks. However, existing planning methodologies falter due to the non-stationary spatiotemporal volatility of e-waste generation, the high reconfiguration costs associated with path-dependent infrastructure, and the “curse of dimensionality” inherent in large-scale dynamic optimization. To address these challenges, we propose LAtt-PR, an innovative hybrid reinforced adaptive optimization framework. The methodology integrates a spatiotemporal attention-based neural network, combining Multi-Head Attention (MHA) for spatial correlation with Long Short-Term Memory (LSTM) units for temporal dependencies to accurately capture and predict fluctuating demand patterns. At its core, the framework employs Deep Reinforcement Learning (DRL) as a high-level action proposer to prune the expansive search space, followed by a Particle Swarm Optimization (PSO) module to perform intensive local refinement, ensuring both global strategic foresight and numerical precision. Experimental results on large-scale instances with 150 nodes demonstrate that LAtt-PR significantly outperforms state-of-the-art benchmarks. Specifically, the proposed framework achieves a solution quality improvement of 76% over traditional metaheuristics Genetic Algorithm (GA)/PSO and 55% over pure DRL baselines Deep Q-Network(DQN)/Proximal Policy Optimization (PPO). Furthermore, while maintaining a negligible optimality gap of less than 4% relative to the exact solver Gurobi, LAtt-PR reduces computational time to just 16% of the solver’s requirement. These findings confirm that LAtt-PR provides a robust, scalable, and efficient decision-making tool for optimizing resource circularity and environmental resilience in volatile, real-world recycling logistics.

Keywords:

location planning; WEEE recycling; deep reinforcement learning; particle swarm optimization

MSC:

90B80; 68T07

1. Introduction

Driven by the rapid advancement of global information technology and the accelerating turnover of electronic products, the generation of WEEE has witnessed a sustained and significant surge. According to the United Nations Environment Programme, global WEEE production exceeds 50 million tons annually, growing at an approximate rate of 5% per year. A substantial portion of this waste remains inadequately managed, posing a pressing global environmental challenge [1,2,3]. WEEE contains hazardous substances such as lead, mercury, cadmium, and non-degradable organic compounds, which, if improperly handled, can severely contaminate soil, water, and air, thereby threatening human health. Concurrently, WEEE harbors valuable recoverable resources, including precious metals like gold, silver, and copper, as well as rare earth elements. Effective recycling can mitigate environmental pollution, promote resource circularity, and alleviate the tension between resource scarcity and economic development [4,5,6,7].

In the development of WEEE recycling systems, the strategic configuration of collection point layouts directly determines recycling efficiency and logistical expenditures [8,9,10]. However, given the accelerated turnover of electronic products and the dynamic evolution of market conditions, existing layout models reveal three fundamental contradictions when addressing multi-period dynamic transitions:

A substantial discrepancy exists between static planning and the dynamic fluctuations inherent in WEEE generation. Driven by consumption waves, policy subsidies, and product life cycles, WEEE generation is characterized by non-stationarity and periodic variability [2]. Currently, most collection point layouts are designed based on static snapshots of specific time nodes, lacking sufficient foresight regarding long-term flow evolution. This rigid approach causes recycling systems to experience capacity overloads and delayed responses during peak periods, while simultaneously facing significant facility underutilization and resource wastage during troughs, ultimately leading to severe resource misallocation.
The multi-period evolution of the network is governed by path dependency and substantial adjustment costs. The establishment of WEEE collection points entails significant sunk costs, including site selection, construction, and equipment procurement, which result in layout schemes with pronounced path dependency. Within the multi-period planning framework, decision-makers must navigate a complex trade-off between operational efficiency and restructuring expenditures [11]. Although frequent reconfigurations to accommodate market shifts incur high migration and expansion costs, maintaining the status quo leads to a progressive deviation from the optimal configuration as the environment evolves. Current decision-making frameworks lack a dynamic equilibrium mechanism to determine the optimal balance between adjustment costs and systemic benefits over an extended planning horizon.
The inherent sequential nature of the problem necessitates a global temporal perspective to achieve optimization throughout the entire lifecycle. Within a multi-period planning horizon, site selection is characterized not as a series of isolated events but as a continuous evolution where current decisions profoundly influence the state and flexibility of the system in subsequent stages. This “ripple effect” requires a holistic decision-making logic that transcends immediate gains to incorporate long-term interdependencies within the recycling network. Devoid of such a global temporal perspective, the system risks becoming trapped in suboptimal configurations that impede the maximization of cumulative rewards across the total planning cycle, thereby compromising resilience within a complex and volatile long-term market.

As illustrated in Figure 1, the framework employs a systematic data generation pipeline to transform real-world WEEE dynamics into train-ready instances. The process commences with the Mapping phase involving spatial discretization, where continuous urban spatial distributions are mapped to a discrete topology. To mirror realistic urban layouts, we employ Gaussian Mixture Model (GMM) to simulate non-uniform population hotspots, such as dense commercial districts and residential communities, which are identified as waste generation sources marked as triangles. Conversely, K-Means clustering is applied to determine the geometric centers for candidate facility locations, represented by squares, typically situated in accessible low-density zones like industrial parks. Subsequently, the Sampling phase constructs the Spatiotemporal Demand Matrix by simulating the waste generation volume for individual sources across multiple planning periods. This matrix is derived from a composite stochastic function designed to capture specific market behaviors: it aggregates a baseline load, a sinusoidal component reflecting seasonal fluctuations (e.g., post-holiday disposal peaks), a linear trend representing the annual growth of e-waste, and Gaussian noise for daily operational uncertainty. Finally, these spatiotemporal patterns are synthesized via Monte Carlo methods for Instance Generation [12], producing a diverse dataset that encompasses various volatility scenarios to robustly train the LAtt-PR model.

Conventional site selection methodologies frequently depend on empirical approaches or simplistic models, leading to suboptimal configurations, increased expenditures, and inadequate coverage [13,14]. Motivated by the necessity to overcome the constraints of traditional single-period models, particularly their inability to manage temporal volatility and high computational complexity, the optimization of recycling network configurations continues to be a pivotal research priority. This objective has driven the technological transition from traditional static optimization toward dynamic, intelligent methodologies, significantly enhancing the adaptability and efficiency of layout decisions [15,16,17]. Initial research primarily utilized static optimization techniques, such as Mixed-Integer Linear Programming (MILP). For instance, Gomes et al. [10] leveraged MILP to optimize recycling network configurations with an emphasis on coverage and accessibility, while Owen et al. [4] provided a comprehensive review of static facility location methods, highlighting their limitations in addressing dynamic demands. Similarly, Bereketli et al. [9] applied a fuzzy Linear Programming Technique for Multidimensional Analysis of Preference (LINMAP) method to evaluate WEEE treatment strategies, although its computational complexity restricted its applicability to large-scale dynamic problems. Consequently, these static models proved insufficient for managing the long-term fluctuations and uncertainties inherent in WEEE generation [18].

The advent of multi-period planning theory addressed these deficiencies by emphasizing the dynamic evolution of facility configurations over time [19,20]. For instance, the multi-period model developed by Shi et al. [8] incorporated dual economic and environmental objectives to address demand uncertainty, while Liang et al. [21] demonstrated the superiority of multi-period frameworks in cost optimization and demand adaptability using fuzzy multi-objective programming. Furthermore, Baxter et al. [13] investigated the influence of geographical and operational parameters on WEEE transportation networks, emphasizing the significance of multi-period planning within dynamic environments. However, traditional multi-period models are often predicated upon complex mathematical formulations, resulting in low computational efficiency and difficulties in resolving high-dimensional dynamic problems. To mitigate these limitations, intelligent optimization algorithms such as PSO were leveraged for their global search capabilities [22,23]; nevertheless, their susceptibility to local optima and restricted generalization in long-term scenarios persist as significant drawbacks [24].

The integration of Reinforcement Learning (RL) has introduced a transformative paradigm for dynamic layout optimization, facilitating adaptive decision-making through environmental interaction. Recent studies have demonstrated the potential of this approach: Miao et al. [25] employed DRL to resolve multi-period p-median location problems, Liang et al. [26] proposed an adaptive interactive attention model to enhance decision accuracy, and Wang et al. [27] integrated graph attention networks with RL to improve algorithmic adaptability. Despite these advantages, the efficiency of RL in global search and its convergence speed necessitate further optimization, leading to a burgeoning trend of synthesizing the dynamic decision-making capabilities of RL with robust metaheuristics [28,29,30,31]. While specific investigations into WEEE recycling have explored bi-level programming [32] or reverse logistics models emphasizing practical utility [33], an adaptive intelligent decision-making framework capable of comprehensively addressing the multifaceted challenges of such dynamic environments remains to be fully developed.

However, existing methodologies exhibit distinct limitations in the face of these real-world constraints: static models fail to adapt when the amount of waste fluctuates greatly; single-period optimizations often ignore the reality that replacing facilities is too expensive due to high sunk costs; and standard algorithms struggle when the problem is too complex to solve in large-scale dynamic environments. To bridge these specific gaps, the proposed LAtt-PR framework integrates Long Short-Term Memory (LSTM) based prediction to handle volatility, multi-period reinforcement learning to optimize long-term cumulative rewards, and a hybrid search strategy to resolve computational complexity.

Against this background, this paper proposes LAtt-PR, a hybrid DRL-PSO framework for multi-period WEEE recycling site layout optimization. Explicitly designed to tackle the specific facility location challenges encountered in real-world recycling, LAtt-PR aims to resolve the supply-demand mismatch caused by the spatiotemporal volatility of e-waste, balance system flexibility against path-dependent high sunk costs, and ensure optimal resource allocation in large-scale metropolitan networks. LAtt-PR specifically addresses the computational inefficiency of MILP and the local optima vulnerabilities of PSO by leveraging DRL to generate robust initial solutions tailored to dynamic WEEE patterns, which are subsequently refined via the global search capabilities of PSO. By incorporating an encoder-decoder architecture with LSTM to capture long-term WEEE trends and Gated Recurrent Units (GRUs) to enhance decision precision [34,35], the model ensures robust decisions under the spatiotemporal constraints of the recycling network. Simulations validate its efficacy for urban planners, promoting sustainable and intelligent WEEE management systems.

To clearly delineate the research positioning and contribution of this study, Table 1 provides a comprehensive comparison between the proposed LAtt-PR framework and existing methodologies, highlighting the specific limitations in the current literature that our approach aims to resolve.

The key contributions of this work are outlined as follows:

A novel hybrid DRL-PSO framework for long-term sequential optimization is introduced. We propose LAtt-PR, a hybrid decision-making framework that integrates DRL with PSO. By modeling the multi-period layout problem as a Markov Decision Process (MDP), LAtt-PR effectively overcomes the myopic limitations inherent in traditional methodologies. Unlike short-sighted approaches, our framework focuses on maximizing long-term cumulative rewards, ensuring that current site selection decisions are strategically aligned with future system states and multi-stakeholder constraints.
Advanced spatiotemporal feature extraction to resolve demand-layout mismatch. LAtt-PR introduces a sophisticated architecture combining a Transformer-based encoder-decoder with LSTM and GRUs. This synergy is specifically designed to address the non-stationarity and volatile fluctuations inherent in WEEE generation. The Transformer processes multi-regional spatial correlations, while the LSTM captures long-term temporal trends, directly mitigating the discrepancy between static facility configurations and dynamic demand patterns that characterizes conventional recycling networks.
A resilient multi-period optimization mechanism balancing adjustment costs and sustainability. Within a dynamic mixed-integer programming context, LAtt-PR explicitly accounts for path dependency and the substantial expenditures associated with network reconfiguration. By penalizing unserved waste and dynamically adapting configurations across discrete periods, the model identifies the optimal equilibrium between operational efficiency, economic costs, and environmental impacts. This approach provides a robust and sustainable decision-making tool that maintains resilience under the escalating challenges of global e-waste management.

2. Problem Description and Modeling

2.1. Multi-Period Facility Location Model for WEEE Recycling Networks

In order to establish scientifically structured WEEE recycling networks, this paper models the facility layout challenge as a multi-period facility location optimization problem characterized by explicit temporal progression. Given the continuous expansion of urban WEEE volumes, the strategic planning of recycling infrastructure must integrate two primary dimensions: the rationality of spatial site selection and practical constraints, such as phased capital investment and incremental policy implementation. Consequently, the planning horizon is partitioned into multiple discrete periods. Within each period, dynamic decisions are formulated based on the current environmental state to determine whether to establish new recycling facilities at predefined candidate locations or adjust existing capacities. Once commissioned, a facility is regarded as permanently operational for all subsequent periods. The optimization objective is to minimize total systemic expenditures, encompassing construction and operational costs within limited budget constraints, while simultaneously maximizing the overall efficiency of WEEE recycling.

LAtt-PR is leveraged to resolve a multi-period waste treatment facility location problem, the graphical representation of which is illustrated in Figure 2. The model processes input parameters encompassing identified waste generation points and a predefined set of candidate facility sites. Utilizing a multi-stage optimization strategy, the framework dynamically determines the optimal facility locations within each period and allocates waste sources to the corresponding active service nodes. The diagrams, sequentially arranged from left to right, depict the initial input configuration alongside the evolving location-allocation outcomes across multiple discrete periods, effectively demonstrating the dynamic decision-making capability of the model throughout the entire planning horizon.

On this basis, as illustrated in Table 2, let I denote the set of candidate facility locations and J represent the set of WEEE generation sources. Let T signify the total number of planning periods. Define the binary variable

x_{i}^{t}

to indicate whether facility i is newly commissioned in period t,

y_{i}^{t}

to denote whether facility i remains operational at the conclusion of period t, and

z_{i j}^{t}

to specify whether source j is served by facility i during period t. Let

q_{j}^{t}

represent the volume of WEEE generated at source j in period t. Furthermore, the parameter

c_{i}

signifies the construction and operational expenditures associated with facility i,

d_{i j}

denotes the spatial distance between facility i and source j, and

r_{i}

defines the service radius of facility i.

The objective function is formulated as a multi-period cumulative cost minimization model, incorporating construction expenditures, operational costs, transportation costs, and a penalty term for uncollected waste, as expressed below:

min \sum_{t = 1}^{T} (\sum_{i \in I} x_{i}^{t} \cdot c_{i} + \sum_{i \in I} y_{i}^{t} \cdot o_{i} + γ \cdot \sum_{i \in I} \sum_{j \in J} z_{i j}^{t} \cdot q_{j}^{t} \cdot D_{i j} + λ \cdot \sum_{j \in J} q_{j}^{t} \cdot (1 - \sum_{i \in I} z_{i j}^{t}))

(1)

Specifically, the first term represents the total construction costs of all facilities newly built in period t, where

x_{i}^{t} = 1

indicates that facility i is constructed in period t and

c_{i}

is the construction cost of facility i. The second term represents the total operational costs incurred by all facilities that have been built and are operational up to period t. Here,

y_{i}^{t}

denotes the cumulative operational status of facility i (0 or 1) and

o_{i}

is the unit operating cost. The third term represents the total transportation costs of all allocated WEEE, calculated based on the waste quantity

q_{j}^{t}

and the distance

D_{i j}

between facility i and source j. The final term accounts for the penalty in case a particular source j is not served by any facility, i.e.,

z_{i j}^{t} = 0

. This term penalizes the amount of WEEE not covered by multiplying it by a penalty coefficient

λ

, thereby encouraging the model to maximize resource coverage. Meanwhile, we obtain the following constraints:

\begin{matrix} y_{i}^{t} = y_{i}^{t - 1} + x_{i}^{t}, y_{i}^{0} = 0, \forall i \in I, t = 1, 2, \dots, T \end{matrix}

(2)

\begin{matrix} z_{i j}^{t} \leq y_{i}^{t}, \forall i \in I, j \in J, t = 1, 2, \dots, T \end{matrix}

(3)

\begin{matrix} z_{i j}^{t} = 0, if D_{i j} > β_{i}, \forall i \in I, j \in J, t = 1, 2, \dots, T \end{matrix}

(4)

\begin{matrix} \sum_{i \in I} z_{i j}^{t} \leq 1, \forall j \in J, t = 1, 2, \dots, T \end{matrix}

(5)

\begin{matrix} x_{i}^{t}, y_{i}^{t}, z_{i j}^{t} \in {0, 1}, \forall i \in I, j \in J, t = 1, 2, \dots, T \end{matrix}

(6)

The first constraint stipulates that once a facility i is commissioned in a specific period, it remains operational throughout all subsequent periods. The second constraint specifies that source j can be served by facility i only if that facility is active (

y_{i}^{t} = 1

). The third constraint dictates that a service relationship cannot be established if the distance between source j and facility i exceeds the predefined service radius

r_{i}

. The fourth constraint ensures that each source is served by at most one facility per period to preclude redundant service allocations. Finally, the last constraint defines all decision variables as binary, restricting their values to the set

{0, 1}

.

2.2. Reinforcement Learning Framework

To address the complex dynamic variations and phased uncertainties inherent in the multi-period layout optimization of WEEE recycling facilities, this paper models the problem as an MDP. This modeling framework effectively characterizes the interactive mechanism between the agent and the environment, enabling the agent to progressively learn and optimize a long-term layout strategy. In doing so, it aims to maximize resource coverage and enhance overall system operational efficiency while satisfying cost constraints. Specifically, MDP can be formally defined as a five-tuple:

M = (S, A, P, R, γ)

, where

S

represents the system states, primarily including information such as the waste generation amounts at the source points and their geographic locations, which together provide the agent with a comprehensive perception of the current environment.

A

denotes the set of available actions, where the agent selects an appropriate number of candidate facilities to construct based on the current state, subject to feasibility constraints.

P

is the state transition probability function, describing the probability of transitioning from state

s_{t}

to a new state

s_{t + 1}

after executing action

a_{t}

. Due to the stochastic nature of factors such as waste generation, the state transition probabilities are not explicitly modeled in this paper, but are instead dynamically learned through agent-environment interactions.

R

denotes the reward function, which evaluates the quality of the agent’s current facility location decisions. The primary goal of the agent is to optimize its policy through continuous interaction with the environment to maximize cumulative rewards. Finally,

γ

is the discount factor, which determines the influence of future rewards on current decisions and takes a value between 0 and 1.

The state space is used to describe the environmental information of the system at a specific decision-making moment and serves as a critical foundation for policy decisions in reinforcement learning. To comprehensively capture the dynamic characteristics of the WEEE recycling network at different stages, this paper incorporates multi-source information into the state space to enhance the agent’s environmental perception and improve the responsiveness of its policy. Specifically, the state

s_{t} \in S

in period t is defined as the following vector composition:

s_{t} = (x_{t}, q_{t}, D, y_{j, t}, f_{t - 1})

, where

x_{t} \in {0, 1}^{| I |}

is the facility layout status vector, indicating which facilities in the candidate set I have been constructed and are operational by the end of period t;

q_{t} = {q_{j, t}}_{j \in J}

represents the amount of WEEE generated at each source

j \in J

during period t;

D = {D_{i j}}

denotes the fixed distance matrix between facilities and source points, precomputed based on geographic information;

y_{j, t} \in {0, 1}

indicates whether source j is allocated to facility i during period t; and

f_{t - 1}

represents the feedback indicator set from the previous period’s layout results, primarily reflecting cost feedback, to enhance the temporal awareness of the model.

The action space is defined as the set of decision options available to the agent under any given state

s_{t}

, specifically corresponding to the decision of whether to construct one or more candidate facilities in period t. Since once constructed, a recycling facility is considered permanently operational, the actions available in each period are influenced by historical decisions and budget constraints. Formally, the action

a_{t} \in A

in period t is represented as a Bernoulli vector:

a_{t} = {a_{i, t} \in {0, 1} ∣ i \in I}

, where

a_{i, t} = 1

indicates that facility i is newly constructed in period; it is 0 otherwise. Furthermore, if a facility i has already been constructed in any previous stage, it is not allowed to be selected again in the current period.

The reward function is the core driving factor in reinforcement learning. In our problem,

R (s_{t}, a_{t})

is defined as follows:

\begin{matrix} R_{t} & = - (C_{t}^{build} + C_{t}^{op} + C_{t}^{trans} + λ \cdot C_{t}^{penalty}) \end{matrix}

(7)

where

\begin{matrix} C_{t}^{build} & = \sum_{i \in I} a_{i, t} \cdot C_{i}^{build} \end{matrix}

(8)

\begin{matrix} C_{t}^{op} & = \sum_{i \in I} x_{i, t} \cdot C_{i}^{op} \end{matrix}

(9)

\begin{matrix} C_{t}^{trans} & = \sum_{i \in I, j \in J} y_{j, t} \cdot d_{i j} \cdot q_{j, t} \cdot c^{trans} \end{matrix}

(10)

These terms respectively represent the construction cost, operational cost, and transportation cost in period t. In addition,

C_{t}^{penalty}

denotes the penalty incurred when a source point is not served by any facility, indicating insufficient system service capacity. Furthermore,

λ > 0

.

λ

is a hyperparameter that controls the weight of the penalty associated with unserved demands.

Considering that the problem studied in this paper has a clearly defined upper limit on the planning horizon T, it falls within the category of finite-horizon reinforcement learning problems. Therefore, no discounting of future rewards is applied in the objective function. In other words, the discount factor is simplified to

γ = 1

in the model. This setting implies that the importance of decisions across all periods is treated as equivalent, which helps the model achieve balanced optimization of overall returns in multi-stage tasks. Accordingly, the reinforcement learning objective can be expressed as

max_{π} E_{π} [\sum_{t = 0}^{T} R (s_{t}, a_{t})]

(11)

Here,

π

denotes the policy which the agent aims to optimize and

π (a_{t} ∣ s_{t})

represents the probability distribution of selecting action

a_{t}

under state

s_{t}

, generated by the policy network. Through policy iteration and experience accumulation, the agent progressively improves the quality of its decisions, thereby achieving dynamic optimization of the WEEE recycling facility deployment.

3. Methodology

3.1. LAtt-PR

To optimize multi-period WEEE recycling facility layouts, LAtt-PR employs a sequence-to-sequence model integrating DRL and PSO, as illustrated in Figure 3. The core design philosophy of the LAtt-PR framework resides in DRL-guided search and PSO-refined decision-making. Specifically, the multi-period planning process is decomposed into a sequence of continuous collaborative decision stages. A DRL agent serves as the high-level policy generator, responsible for extracting spatiotemporal features from complex environmental states and generating high-quality initial solutions, i.e., data-driven heuristic initial solutions. Subsequently, PSO is integrated as a low-level optimizer to perform high-intensity parallel local searches, i.e., population-based iterative local refinement, within the neighborhood of the solutions provided by the DRL. This synergy effectively evades local optima that the DRL might encounter and ensures the engineering precision of the final facility location schemes. Ultimately, this hybrid mechanism achieves a robust balance between exploration and exploitation, aiming to output a cost-optimal global site layout sequence that adapts to fluctuations in WEEE generation.

The LAtt-PR framework is explicitly decomposed into two collaborative functional modules. The Perception and Prediction Module, implemented via the Spatiotemporal Attention Neural Network, analyzes the environment where LSTM units predict waste generation volume by capturing temporal evolution, while the MHA mechanism extracts spatial correlations to determine demand concentration relative to candidate sites. Subsequently, the Decision and Optimization Module, employing the Hybrid DRL+PSO Strategy, executes site selection based on these features by utilizing the DRL agent for global strategic search and the PSO algorithm for intensive local refinement to identify the precise optimal recycling facility locations.

The DRL agent processes environmental states through an integrated architecture, utilizing an Embedder for feature mapping, MHA for spatial correlation capture, and LSTM for temporal modeling. This spatiotemporal extraction enables the decoder to generate

A_{0}

, representing a rapid, experience-based initial assessment of the optimal facility location scheme.

However,

A_{0}

may not represent the global optimum due to network approximation errors. Consequently, the LAtt-PR framework introduces a critical coordination mechanism: converting the DRL-generated proposal

A_{0}

into initial particles for the PSO module. This transition significantly constrains the PSO search space, facilitating high-efficiency optimization within the neighborhood of high-quality initial solutions rather than necessitating a blind search from a randomly initialized solution space. Through information sharing and velocity updates among particles, the PSO module executes an iterative search in the proximity of

A_{0}

, ultimately converging to a superior site selection configuration, denoted as the best solution

A^{*}

. Finally, this dual-optimized action

A^{*}

is executed within the simulated environment, yielding an immediate reward

R_{t}

and transitioning the system state to

S_{t + 1}

. Simultaneously, the optimized solution

A^{*}

and its corresponding reward are fed back to the DRL agent as high-quality label data to update the policy network parameters, thereby enabling more precise proposals in subsequent decision cycles.

By decoupling complex dynamic facility location problems into a standardized pipeline of intelligent perception, initial proposal, population optimization, and closed-loop learning, the LAtt-PR framework achieves an organic integration of data-driven adaptive decision-making and rule-based numerical optimization. This multi-stage collaborative mechanism not only ensures the algorithm’s acute perception of dynamic environments but also substantially enhances the quality of final solutions through PSO refinement, providing a feasible technical path for addressing complex real-world WEEE recycling network planning problems.

The internal network structure of the reinforcement learning agent is further detailed, as illustrated in Figure 4. Within this architecture, the encoder and decoder cooperate to jointly generate and optimize multi-period layout schemes. The encoder module first receives multi-source information, encompassing the spatial coordinates of candidate recycling facilities, the dynamic waste generation volumes at source points, transportation accessibility, and historical layout states, which can be formally represented as

X_{0} = {r_{i}, z_{i j}^{(t - 1)}, d_{i j}, q_{j}^{(t)}, s_{i}^{(t - 1)}, c_{i}^{(t - 1)}}

(12)

where

r_{i}

represents the service radius of facility i;

z_{i j}^{(t - 1)}

indicates whether facility i provided service to source j in the previous period;

d_{i j}

is the distance between them;

q_{j}^{(t)}

denotes the waste generation amount at source j in period t;

s_{i}^{(t - 1)}

represents the construction status of facility i in the previous period;

c_{i}^{(t - 1)}

is the cumulative service cost of facility i from the previous period. These inputs are first mapped into a high-dimensional vector space through the embedder layer, formulated as

e_{i} = W_{x} x_{i} + b_{x}

(13)

where

W_{x}

and

b_{x}

are learnable parameters.

Subsequently, the encoder employs a multi-head self-attention mechanism to perform parallel modeling of the high-dimensional features, extracting spatial relationships and resource competition information among different candidate facilities and waste generation sources. The computational expression of the MHA is formulated as

\begin{matrix} MultiHead (Q, K, V) & = Concat ({head}_{1}, \dots, {head}_{h}) W^{O} \end{matrix}

(14)

\begin{matrix} {head}_{i} & = Attention (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}) \end{matrix}

(15)

\begin{matrix} Attention (Q, K, V) & = softmax (\frac{Q K^{⊤}}{\sqrt{d_{k}}}) V \end{matrix}

(16)

where h denotes the number of attention heads and

W_{i}^{O}

,

W_{i}^{K}

, and

W_{i}^{V}

are the linear projection parameters for the different heads.

Beyond feature extraction, this MHA mechanism serves as a crucial interpretability interface for the decision-making process. The computed attention weights explicitly quantify the dependency strength between candidate facility i and waste source j. In the context of WEEE logistics, a high attention score signifies that the model has identified a critical demand cluster or a strategic bottleneck as the primary rationale for activating a specific site. This intrinsic transparency allows decision-makers to visualize and verify the logic behind AI-generated layouts, thereby fostering trust in the automated planning system.

Subsequently, a Feed-Forward Network (FFN) combined with an Add & Norm layer is applied to enhance nonlinear feature representation and stabilize the gradients. In addition, a Graph Attention Network (GAT) is employed to further process the relationships among facilities within the graph structure, learning the potential collaborative or competitive effects between node pairs to form spatial graph features. Finally, a LSTM network is used to integrate these spatial features with the temporal features of historical layout decisions, updating the hidden state as follows:

h_{t} = LSTM (f_{GAT}, h_{t - 1})

(17)

to output a comprehensive spatiotemporal embedded feature vector,

f_{spatio - temporal} = ϕ (h_{t}, f_{GAT}) .

(18)

where

ϕ

denotes feature concatenation, combining the temporal feature

h_{t}

with the spatial feature

f_{GAT}

.

The decoder module mainly implements the sequential decision-making for actions (i.e., constructing new facilities). First, a GRU combines the previous period’s states and embedded features to update the context of the current action:

z_{t} = GRU (f_{spatio - temporal}, z_{t - 1})

(19)

Then, an MHA mechanism is applied to learn the priority among possible actions and obtain the action probability weights. Here, the query

q_{t}

is the GRU output

z_{t}

and the key

k_{t}

is the encoded spatio-temporal features

f_{spatio - temporal}

. The weights are computed as

u_{t} = \frac{q_{t} k_{t}^{⊤}}{\sqrt{d_{k}}}

(20)

p (a_{t} ∣ s_{t}) = softmax (u_{t})

(21)

to satisfy the feasibility constraints in the layout, the decoder additionally introduces a masking layer, which masks out (i.e., assigns a score of

- \infty

to) the probabilities of facilities that have already been constructed or are infeasible for construction, thereby ensuring that the location selection results are unique and comply with the constraints. The modified score

u_{t}^{'}

is thus defined as

u_{t}^{'} = \{\begin{matrix} - \infty & if the site is not selected \\ u_{t} & otherwise \end{matrix}

(22)

After Softmax normalization, the final action probability distribution is formed, from which the actual layout actions

a_{t}

are generated using either policy sampling or a greedy strategy. The corresponding log probability is recorded for policy update:

log π (a_{t} ∣ s_{t})

(23)

This is to facilitate subsequent policy gradient optimization. Based on the policy gradient framework, this paper adopts the REINFORCE algorithm to update the parameters, with the objective of minimizing the cumulative expected cost, which is equivalent to maximizing the negative long-term expected loss:

L (θ) = E_{π_{θ}} [\sum_{t = 1}^{T} c_{t} (a_{t})]

(24)

where

c_{t} (a_{t})

represents the immediate cost incurred under the action taken in period t, including construction, operational, and transportation costs. The parameter update is performed using the following gradient expression:

\nabla_{θ} L (θ) = E_{π_{θ}} [\nabla_{θ} log π_{θ} (a_{t} ∣ s_{t}) \cdot (R - b)]

(25)

where R is the actual cumulative cost from period t to T and b is the baseline value, which helps to reduce the variance of the gradient estimate.

3.2. PSO-RL Collaborative Mechanism

To mitigate the tendency of RL to become trapped in local optima during action policy generation, this paper integrates PSO to conduct a global search based on the preliminary action schemes produced by the RL module. Leveraging its population-based cooperative search mechanism, PSO performs secondary optimization on the action sequences generated by RL, exploring a broader solution space to identify potentially superior solutions and enhance the global performance of the overall action schemes.

The strategic selection of PSO as the downstream local optimizer, in preference to other metaheuristics such as GA or Variable Neighborhood Search (VNS), is predicated on its specific compatibility with the proposed hybrid DRL framework and the dynamic nature of the multi-period location problem. Unlike GA, which relies on stochastic crossover and mutation operators that risk disrupting the high-quality spatial topology learned by the DRL agent, PSO employs a velocity-based update mechanism. This facilitates smooth, fine-grained exploration within the neighborhood of DRL-initialized solutions, effectively refining numerical precision while preserving learned structural patterns. Furthermore, considering the stringent efficiency requirements of large-scale dynamic decision-making, the global information sharing mechanism of PSO enables faster convergence compared to the sequential neighborhood switching of VNS. Crucially, the inherent memory mechanism of PSO, specifically the retention of personal best positions, naturally complements the path-dependent nature of the facility location problem, ensuring that the optimization process respects historical constraints and sunk costs while exploring future possibilities.

LAtt-PR integrates the RL Agent with the PSO module to form a collaborative optimization loop, as illustrated in Figure 5. The internal mechanism operates such that the environment first transmits state information

S_{t}

to the policy network of the reinforcement learning agent. The agent subsequently leverages this information to generate preliminary action schemes

A_{initial}

. Thereafter, the PSO module adopts these preliminary schemes as the initial positions

x_{i}^{(0)}

for the particle swarm and iteratively refines them through the velocity–position update equations:

\begin{matrix} v_{i}^{(t + 1)} & = ω v_{i}^{(t)} + c_{1} r_{1} (p_{i}^{best} - x_{i}^{(t)}) + c_{2} r_{2} (g^{best} - x_{i}^{(t)}) \end{matrix}

(26)

\begin{matrix} x_{i}^{(t + 1)} & = x_{i}^{(t)} + v_{i}^{(t + 1)} \end{matrix}

(27)

where

ω

denotes the inertia weight;

c_{1}, c_{2}

represents the learning factor;

r_{1}, r_{2}

is the random coefficient;

p_{i}^{best}

indicates the personal best position of the particle;

g^{best}

represents the global best position.

During each position update, the evaluation module utilizes the fixed cost model defined by the dynamic parameters

S_{t}

of the environment to calculate the fitness value for each particle position. This evaluation process identifies the lowest-cost candidate solutions to update the particle swarm state. Finally, the globally optimal action

A_{gbest}

generated by the PSO module is executed within the environment module. This execution yields the cost-reward signal

C_{t}

, which is fed back to the RL Agent for policy gradient updates, thereby driving continuous optimization and refinement of the layout policy in subsequent iterations to ultimately achieve a multi-period, globally optimal facility layout scheme. This methodology fully leverages the perceptual learning capabilities of RL alongside the global search strengths of PSO, forming a collaborative and complementary optimization framework that significantly enhances the stability and optimality of the layout policy within complex and dynamic environments.

The complete training protocol synergistically combines the principles of DRL and the PSO algorithm, as described in Algorithm 1. The framework establishes the policy network

π_{θ}

as its computational core, leveraging its expressive learning capacity to perceive complex spatiotemporal features and generate a preliminary action scheme with long-term strategic significance.

Algorithm 1 LAtt-PR Optimization Training Protocol

Input:: S, $N_{epoch}$ , B, $PSO_params (n_{particles}, ω, c_{1}, c_{2})$
Output:: Optimized $θ$
1:: Initialize policy network $θ$ , baseline B, PSO hyperparameters
2:: for epoch $= 1$ to $N_{epoch}$ do
3:: for batch in S do
4:: Obtain state $S_{i}$ from environment
5:: $A_{0} \leftarrow π_{θ} (S_{i})$
6:: Initialize PSO particles with $A_{0}$
7:: while not converged do
8:: Evaluate particle fitness via $π_{g}$
9:: Update $p^{best}$ , $g^{best}$
10:: Update particle velocity & position
11:: end while
12:: $A^{*} \leftarrow g^{best}$
13:: Execute $A^{*}$ , get reward, update $θ$
14:: Update environment state
15:: end for
16:: end for
17:: return $θ$

Recognizing that DRL policies often encounter local optima and precision limitations, the framework integrates PSO to ensure algorithmic robustness. The PSO module leverages the DRL-generated scheme as the initial configuration for the particle swarm, conducting an extensive global exploration within the action space and performing rigorous numerical refinement based on precise cost model evaluations. This joint optimization achieves a critical equilibrium: DRL facilitates macroscopic strategic planning, whereas PSO ensures microscopic numerical precision and global optimality. This complementary system significantly enhances the stability and overall performance of the action policy within complex and dynamic environments.

4. Experiments

4.1. Experimental Setup

To ensure that the evaluation results possess broad representativeness and practical significance, the experiment first constructs simulated urban scenarios encompassing diverse dimensional features. Regarding the problem scale, four test gradients with strictly increasing difficulty are designed; the combinations of demand point quantities and planning cycles are configured as illustrated in Table 3. These scale gradients are explicitly defined to mirror real-world administrative hierarchies:

S_{1}

represents community-level micro-planning,

S_{2}

corresponds to district-level scheduling, while

S_{3}

and

S_{4}

simulate complex city-wide and metropolitan logistics networks, respectively. These node configurations are consistent with standard benchmarks in the recent large-scale WEEE logistics literature [25,36], ensuring the evaluation covers the full spectrum of spatial complexity. This configuration aims to simulate the entire realistic spectrum, ranging from community-level micro short-term planning to city-level macro long-term scheduling, while rigorously evaluating the stability of the model under varying spatiotemporal complexities.

In terms of spatial distribution characteristics and network topology construction, a hierarchical design strategy is adopted to simulate the clustering features of population and commercial activities in real cities. First, to model the distribution of demand points, the experiment employs a GMM to construct 3–5 random hotspot areas or a pronounced polycentric distribution within a unit square. This approach simulates the population density variations and spatial heterogeneity ranging from core commercial districts to suburban areas. Furthermore, to reflect the rational characteristics of logistics facility planning, the generation of the candidate site set I abandons simple random point distribution in favor of a clustering-based strategy. Specifically, for the generated non-uniform demand distribution, the K-Means algorithm is applied to extract cluster centroids as the base coordinates for candidate sites. This strategy ensures that candidate facilities are inherently located at the geometric centers of high-density demand areas, representing optimal logistics nodes with the advantage of minimizing potential transportation costs, thereby constructing a spatial topology with high practical relevance.

To accurately simulate the time-varying nature and non-stationarity of WEEE generation in real-world scenarios, this study forgoes traditional static or purely random demand assumptions. Instead, a dynamic demand generation mechanism predicated on the decomposition of trend, seasonal, and stochastic components is developed. Specifically, the WEEE generation volume

d_{j, t}

at each demand point j during period t is derived from the following mixed process:

\begin{matrix} d_{j, t} = μ_{j} \cdot [1 + α \cdot sin (\frac{2 π t}{T}) + β \cdot \frac{t}{T_{max}} + ξ_{t}] \end{matrix}

(28)

In this formulation,

μ_{j}

denotes the baseline demand level of the node. The dynamic nature of demand is captured through several constituent components: the sinusoidal term, with

α

modulating the seasonal intensity, simulates the periodic oscillation of electronic product disposal to reflect cyclical fluctuations driven by replacement cycles or market activities; the linear term incorporates a growth rate

β

to model the long-term upward trajectory of e-waste volume; and

ξ_{t} \sim N (0, σ^{2})

represents Gaussian random noise utilized to account for unpredictable daily fluctuations.

This design not only aligns the testing environment with the fluctuation characteristics of real-world supply chains but, more pivotally, it also constructs a data environment with significant time-series dependency features. This allows for the effective verification of the LSTM module’s capacity within the framework to capture long-term dependencies and non-linear spatiotemporal patterns, determining whether the model has truly acquired the adaptive intelligence required for dynamic demand-based allocation.

To evaluate the total operational expenditures over multiple periods and assess the economic feasibility of the facility location configuration, the experiment establishes a set of benchmark operational parameters. These values are calibrated with reference to the extant literature on WEEE reverse logistics [37] and empirical survey data. Specifically, the site construction cost

f_{i}

is modeled as a fixed cost to reflect the initial investment in land and infrastructure [38], while the unit transportation cost

p_{i}

is defined as a linear function of the Euclidean distance. In particular, to evaluate algorithmic performance within resource-constrained environments, the facility capacity

C a p_{i}

is modeled as a dynamic constraint correlated with the regional average demand [36]. This configuration necessitates a strategic trade-off between load distribution across multiple facilities and centralized processing at fewer facilities [39]. The comprehensive baseline parameter settings are summarized in Table 4.

The operational parameters in Table 4 were calibrated based on standard values in established WEEE literature [37,38] to ensure economic realism and comparability. For the deep learning hyperparameters in Table 5, initial values were determined via a preliminary grid search within ranges recommended by recent studies. The optimality of these selected key parameters is further rigorously verified through the sensitivity analysis presented in Section 4.2.

The training and testing of the deep neural networks were conducted on a high-performance computing platform equipped with an NVIDIA GeForce RTX 4090 GPU (24 GB) and CUDA 12.8. The detailed hyperparameter configurations for model training are presented in Table 5.

This section selects three categories of baseline methods for comparative analysis to establish a rigorous and objective performance evaluation framework. First, exact solvers such as Gurobi and OR-Tools directly resolve the MILP model through the branch-and-bound method; notwithstanding their significant computational requirements, they provide ground-truth optimal solutions against which the optimality gap is measured. Second, metaheuristic algorithms including GA and PSO are incorporated as representative mainstream benchmarks in the engineering domain for addressing such NP-hard problems. Finally, pure reinforcement learning baselines such as DQN and PPO are utilized to represent end-to-end learning paradigms that do not incorporate hybrid search mechanisms, thereby verifying the necessity of the hybrid optimization strategy proposed in this study.

To quantify model performance across multiple dimensions and address the core scientific questions posed earlier, this section establishes a comprehensive evaluation framework encompassing economic feasibility, operational efficiency, and computational characteristics. Within the fundamental economic dimension, Total Expected Cost (TEC) is utilized as the primary metric for evaluating the viability of the proposed schemes. This is supplemented by a granular breakdown and statistical analysis of construction, operation, transportation, and penalty costs to rigorously examine the rationality of the cost structure.

To further investigate the model’s trade-off mechanism across multi-dimensional objectives, this study examines the coupling relationship between service levels and resource allocation efficiency. Specifically, the Average Service Rate (ASR) is introduced as a core indicator to quantify the social benefits of the recycling network. This metric is defined as the proportion of actually collected volume relative to the total dynamic demand

d_{i t}

within the planning horizon:

\begin{matrix} A S R = 1 - \frac{\sum U_{i t}}{\sum d_{i t}} \end{matrix}

(29)

where

U_{i t}

represents the unmet demand determined after decision-making, intuitively reflecting the system’s responsiveness to time-varying demand and its service breadth. Complementary to this is the Capacity Utilization (CU), which aims to monitor the actual load levels of active stations. This allows the model to be scrutinized for whether it has truly acquired “on-demand allocation” dynamic adjustment strategies, effectively avoiding resource waste caused by idle facilities or service bottlenecks resulting from overloads. Finally, given the stringent real-time requirements of large-scale dynamic scheduling, Average Decision Time is used as the key criterion for assessing computational efficiency. This is combined with the cumulative reward convergence trajectory during the DRL training process to comprehensively verify the engineering suitability of the LAtt-PR framework from the perspectives of timeliness and learning stability.

4.2. Comparative Analysis of Overall Performance

Based on the experimental setup described above, Table 6 details the comprehensive performance of each algorithm across various problem scales, encompassing three key metrics: objective function value (Obj.), relative optimality gap (Gap), and computational time (Time).

It can be clearly observed that as the problem scale expands from

N = 20

to

N = 150

, the performance of the algorithms exhibits a significant divergence trend. While exact solvers such as Gurobi consistently identify the global optimal solution, their computational cost escalates exponentially with scale, reaching 10.61 s at

N = 150

. This surge in time complexity renders them unsuitable for real-world demands involving larger-scale instances or real-time dynamic scheduling. In contrast, traditional metaheuristics including GA and PSO encounter severe performance bottlenecks when handling large-scale complex facility location problems. At

N = 150

, their optimality gaps expand sharply to over 16%, indicating that relying solely on random search mechanisms in high-dimensional discrete solution spaces easily leads to local optima and hinders convergence to high-quality solutions. Pure reinforcement learning methods such as DQN and PPO, although achieving extremely fast solving speeds of less than 0.7 s by leveraging neural network inference, remain limited in precision due to the absence of fine-grained local search mechanisms, maintaining a gap of approximately 9% in large-scale scenarios. Notably, the LAtt-PR framework demonstrates superior robustness across all tested scales. Particularly in the most challenging large-scale scenario where

N = 150

, LAtt-PR successfully maintains the optimality gap within 3.98%, representing a performance improvement of approximately 76% compared to GA and PSO and 55% compared to pure RL. Furthermore, its execution time is only 1.71 s, which is approximately 16% of the duration required by Gurobi. These results powerfully demonstrate that LAtt-PR, by organically integrating the fast inference capability of DRL with the local refinement capability of PSO, successfully strikes an optimal balance between solving efficiency and solution quality.

From the perspective of computational sustainability, the LAtt-PR framework offers a distinct advantage over traditional exact solvers. While the offline training of the DRL agent incurs a fixed computational overhead, this is a one-time investment. In contrast, exact solvers like Gurobi rely on Branch-and-Bound algorithms with exponential time complexity, leading to prohibitive energy consumption as the network scale expands.

LAtt-PR adopts a “Train-Once-Deploy-Everywhere” paradigm. Once trained, the model executes inference in polynomial time, enabling real-time decision-making with minimal energy expenditure. For large-scale instances, LAtt-PR achieves a 6× speedup compared to Gurobi, significantly reducing the computational carbon footprint required for routine dynamic scheduling. This characteristic ensures the scalability of the system for metropolitan-level applications, aligning with the principles of sustainable computing by delivering high-quality solutions without the excessive resource consumption typical of combinatorial optimization.

The statistical significance of the observed performance disparities was rigorously evaluated using paired t-tests. As evidenced in Table 7, the calculated p-values across all baseline comparisons consistently fall below the 0.01 threshold. These results provide sufficient statistical evidence to reject the null hypothesis, confirming that the superiority of LAtt-PR is significant at the 99% confidence level and not attributable to stochastic variance.

To further demonstrate the advantages of the LAtt-PR framework regarding solution space structure optimization, this section systematically evaluates the performance of various algorithms in large-scale scenarios (

S 3

and

S 4

). Leveraging the statistical results in Table 8 and the visualizations in Figure 6, we comprehensively assess operational efficiency indicators (ASR, CU) and the refined cost structure.

In the most challenging ultra-large-scale scenario

N = 150

, LAtt-PR demonstrates superior operational efficiency; as illustrated in Figure 6, its ASR curve remains consistently near 100%, mirroring the performance of Gurobi, while GA and PSO decline to approximately 94%. This performance disparity is elucidated by the cost composition: GA and PSO incur high penalty costs exceeding 6.5% due to their myopic attempts to minimize initial construction costs at approximately 15.1%, which leads to severe supply shortages. Conversely, LAtt-PR maintains a balanced construction investment of 16.6% to effectively eliminate penalties, reducing them to only 0.3%, thereby achieving an optimal equilibrium between cost-effectiveness and service quality. This contrast indicates that while traditional metaheuristics are prone to becoming trapped in low-construction-cost local optima, LAtt-PR successfully bypasses such short-sighted strategies within high-dimensional discrete solution spaces.

The advantage of LAtt-PR is fundamentally attributed to its unique spatio-temporal attention-prediction mechanism. On the one hand, the integrated MHA effectively extracts spatial dependencies between demand points to guide facility locations toward high-density areas; on the other hand, the DRL Critic network successfully maps current construction investments to long-term service returns. This endows the agent with proactive decision-making capabilities, allowing it to assume necessary immediate costs to preemptively avoid high future penalties. By combining this forward-looking optimization with PSO’s local refinement, LAtt-PR identifies the global service-cost balance in dynamic environments rather than merely pursuing localized cost minimization.

To further validate the learning stability and convergence efficiency of the algorithm, Figure 7 illustrates the TEC trajectories of LAtt-PR compared with pure RL baselines during the training process under the

N = 100

scale.

LAtt-PR exhibits significantly superior convergence characteristics throughout the training process: regarding convergence speed, the cost curve of LAtt-PR enters a plateau phase after approximately 50 training epochs, whereas PPO and DQN require roughly 100 and 125 epochs, respectively, to reach a comparable steady state, indicating that the local fine-grained search of PSO provides high-quality, low-variance gradient signals for policy updates, thereby effectively accelerating the exploration process within the policy space; furthermore, in terms of stability, the reward curve of LAtt-PR exhibits significantly smaller fluctuations and follows a smooth downward trend, while the curves for PPO and DQN display multiple pronounced oscillations and performance regressions during the mid-training phase, reflecting the issues of policy degradation and high variance inherent in pure reinforcement learning exploration within high-dimensional discrete action spaces.

To rigorously evaluate the model’s resilience against unforeseen demand fluctuations, a sensitivity analysis was conducted by varying the intensity of the stochastic noise term

{\tilde{ξ}}_{t}

. The results, summarized in Table 9, reveal distinct performance trajectories under escalating uncertainty levels (

σ = 0.1 \to 0.5

). While all algorithms exhibit increased costs due to higher volatility, LAtt-PR demonstrates superior stability, limiting the cost increment to 20.39% even under high-noise conditions. In contrast, the PPO and GA baselines show significantly higher sensitivity, with performance degradation rates of 30.43% and 36.16%, respectively. This disparity highlights that while metaheuristics struggle to adapt to stochastic perturbations, the proposed hybrid framework effectively leverages its LSTM module to filter high-frequency noise, thereby ensuring robust long-term decision-making.

To validate the rationale behind the selected parameter configurations, a sensitivity analysis was conducted on three pivotal hyperparameters: the penalty coefficient for unserved waste (

λ

), the learning rate of the DRL agent (

η

), and the inertia weight strategy of the PSO module (

ω

).

As presented in Table 10, the results indicate that

λ = 1.0

achieves the optimal equilibrium, maintaining an ASR of 98.9% while minimizing the TEC. A lower

λ = 0.5

fails to sufficiently penalize unmet demand, dropping the ASR to 92.1%, whereas

λ = 5.0

forces excessive infrastructure construction, inflating the TEC by 13.8%. Regarding the learning rate,

η = 10^{- 4}

demonstrates the most stable convergence; larger rates (

10^{- 3}

) lead to oscillation, while smaller rates (

10^{- 5}

) suffer from slow convergence. Finally, the adaptive linear decay strategy for PSO weights (

ω : 0.9 \to 0.4

) yields a 2.3% cost reduction compared to a fixed weight strategy (

ω = 0.7

), confirming the benefit of dynamic exploration-exploitation balancing.

The aforementioned convergence analysis further validates the effectiveness of the LAtt-PR hybrid architecture: the DRL module is responsible for learning high-level state representations and policy skeletons, while the PSO module acts as an online policy refiner that continuously provides improved action samples through local search, thereby guiding policy updates toward superior directions. This synergistic mechanism not only enhances the quality of single-period decisions but also significantly mitigates common reinforcement learning issues in combinatorial optimization—such as training instability and slow convergence—by reducing the variance of policy gradient estimates, providing a reliable learning framework for efficient and robust dynamic facility location. To further provide a visual demonstration of the LAtt-PR performance, this paper visualizes the solutions generated under the S4 scale in Figure 8; as the iterative cycles progress, the algorithm incrementally increases the number of sites while significantly reducing the average service distance, achieving a favorable balance between coverage and efficiency.

To further assess the practical applicability of the proposed framework in realistic urban environments, a case study was conducted based on the real-world topology of Zhongguancun, Beijing. By extracting geospatial data for 50 residential communities and 15 industrial zones via OpenStreetMap, we constructed a faithful representation of a metropolitan logistics network. The multi-period evolution of the optimized layout is visualized in Figure 9.

As visualized in the figure, the spatiotemporal evolution of the facility network aligns with the logical expansion of urban demand. In the Initial state, waste sources are densely clustered in the residential core. During Period 1, the model strategically activates two primary facilities in central locations to maximize immediate coverage. As demand grows in subsequent periods, the network dynamically expands outward, activating new facilities in peripheral zones to alleviate capacity bottlenecks. The dense web of allocation lines demonstrates efficient load balancing across the network. This trajectory confirms that LAtt-PR can autonomously generate a hierarchical, cost-effective infrastructure layout that adapts to the complex, irregular topology of real-world metropolitan environments.

4.3. Ablation Experiments for Critical Components

After establishing the overall performance advantages of the LAtt-PR framework, this paper designs two sets of controlled experiments to further dissect the internal sources of its superior performance and verify the necessity of each core module’s design, specifically exploring the contributions of the hybrid optimization strategy and the spatiotemporal neural network components. First, focusing on the synergistic gains of the hybrid search strategy, this study aims to quantify the local refinement contribution of the PSO module within the framework by constructing an ablation variant, LAtt-PR w/o PSO, which eliminates the back-end population search process and directly employs the probability distribution output by the DRL policy network for sampling decisions.

To provide a visual demonstration of the universal performance gains achieved through hybrid search across varying problem scales, Figure 10 illustrates the cost convergence trajectories of the full LAtt-PR model compared to the variant excluding PSO, denoted as w/o PSO, across four scenarios

S_{1}

to

S_{4}

. It is clearly observable that the hybrid strategy outperforms the ablation variant in both convergence velocity and solution quality. Specifically, LAtt-PR, represented by the red line, exhibits superior convergence efficiency across all tested scales. Compared to the w/o PSO variant, indicated by the blue line, which relies solely on policy gradients for exploration, the integration of PSO enables the algorithm to achieve a steeper descent during the initial training stages and ultimately reach a lower steady-state cost. More crucially, as the problem scale expands from

N = 20

in Figure 10a to

N = 150

in Figure 10d, the performance gap between the two models shows a significant non-linear widening trend.

Within the small-scale scenario

S_{1}

, the solution space remains relatively compact, which permits pure DRL to approximate the optimal solution; consequently, the marginal utility of PSO refinement is constrained, leading to closely aligned performance curves. Conversely, in the ultra-large-scale scenario

S_{4}

, which is characterized by an exponentially expanding solution space, pure DRL frequently fails to identify global extrema, resulting in convergence stagnation at elevated cost levels. In this context, the swarm intelligence search mechanism of PSO becomes decisive. By executing high-density local optimization within the promising neighborhoods identified by DRL, it significantly diminishes the final objective cost. This phenomenon profoundly underscores the necessity of the synergistic coarse-grained guidance and fine-grained polishing framework: DRL facilitates rapid pruning and yields high-quality initial search points, which effectively prevents PSO from becoming trapped in local optima within high-dimensional spaces; in turn, PSO compensates for the inherent precision limitations of DRL during the final refinement stage, ensuring that the architecture maintains a competitive advantage even in large-scale and complex environments.

To verify the structural robustness and functional necessity of the core neural components within the LAtt-PR encoder-decoder architecture—specifically the efficacy of the parallel fusion mechanism integrating FFN and GAT—this section establishes a series of ablation experiments. These investigations encompass a baseline model

M_{0}

and six variant models

M_{1}

through

M_{6}

, as detailed in Table 11. The experiments were executed at a standardized scale of

N = 100

while maintaining uniform training hyperparameters. By evaluating the solution quality and computational efficiency across these diverse network architectures, this study aims to rigorously quantify the marginal contribution of each specific component to the overall performance of the framework.

Based on the ablation results presented in Figure 11, the full model

M_{0}

achieves the optimal TEC of 22.39, significantly outperforming all comparative variants. This robustly validates the structural rationality of the LAtt-PR architecture. First, examining variants related to the parallel feature extraction mechanism (

M_{1}

–

M_{3}

), it is observed that

M_{2}

suffers the most severe performance degradation, with costs surging to 25.36, representing a 13.25% gap. This indicates that the absence of spatial topological information causes the model to lose its perception of proximity effects, leading to geographically irrational facility location schemes. Meanwhile, the performance decline of

M_{1}

, showing a 6.02% gap, reveals the potential over-smoothing issues caused by sole reliance on it, emphasizing the importance of preserving node intrinsic features. While the serial structure

M_{3}

with a 3.08% gap performs better than the single-branch variants, it still falls short of

M_{0}

, further confirming the advantages of the parallel fusion mechanism in maximizing the purity of multi-source features.

The ablation of temporal modules highlights the critical role of long-range memory. The cost for

M_{4}

increases to 24.91, a rise of 11.24%, suggesting that a simple Multi-Layer Perceptron cannot effectively capture the dynamic evolution trends of WEEE generation.

M_{5}

, which completely ignores future information, exhibiting the worst performance, with a gap of 18.49%. Its extremely short solving time of 0.45 s is achieved at the expense of long-term planning capabilities, confirming that temporal modeling is decisive in preventing blind initial construction and subsequent capacity shortages. Finally, the

M_{6}

variant shows a 4.26% performance setback; this implies that while GAT effectively processes local neighborhood information, MHA remains irreplaceable for capturing long-distance dependencies across regions and ensuring global logistics coordination. In summary, the core components of LAtt-PR are not merely stacked together; instead, through organic synergy, they collectively guarantee the model’s robustness and superiority in complex dynamic environments.

5. Conclusions and Future Outlook

This paper addresses the critical conflict between static facility planning and dynamic WEEE generation by proposing LAtt-PR, a hybrid evolutionary-reinforcement learning framework. Crucially, the value of LAtt-PR extends beyond algorithmic performance; it provides a robust decision-support mechanism for the WEEE recycling industry. It empowers urban planners to transition from reactive, experience-based siting to proactive, data-driven lifecycle management. By balancing immediate operational efficiency against long-term path-dependent sunk costs, the framework ensures that recycling infrastructure remains resilient to market volatility, thereby promoting the sustainability and circularity of urban resource management systems.

Future research could enhance LAtt-PR by integrating multi-agent reinforcement learning to enable collaborative, global optimization across multi-regional WEEE recycling layouts, and incorporating privacy-preserving techniques like federated learning to ensure secure cross-regional data sharing. Furthermore, addressing cyber-physical threats is critical; future work will explore adaptive cooperative fault-tolerant control mechanisms to protect the system against stochastic False Data Injection (FDI) attacks, ensuring robust operation even under output-constrained nonlinear conditions [40,41,42]. Finally, we plan to investigate meta-learning or adaptive learning mechanisms to efficiently transfer policies in environments with rapidly changing demands.

Author Contributions

Conceptualization, Z.Q. and Y.Z.; methodology, Z.Q.; investigation, Z.Q.; resources, J.W.; writing—original draft preparation, Z.Q.; writing—review and editing, Y.Z. and X.Y.; supervision, J.W., Y.Z. and X.Y.; project administration, J.W.; funding acquisition, J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Qingdao Municipal Science and Technology Bureau, grant number 24-1-2-qljh-25-gx.

Data Availability Statement

The original contributions presented in this study are included in the article. For further inquiries, please contact the corresponding author.

Acknowledgments

The authors thank the anonymous referees for their many valuable and helpful suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ASR	Average Service Rate
CU	Capacity Utilization
DQN	Deep Q-Network
DRL	Deep Reinforcement Learning
FDI	False Data Injection
FFN	Feed-Forward Network
GA	Genetic Algorithm
GAT	Graph Attention Network
GMM	Gaussian Mixture Model
GRU	Gated Recurrent Unit
LINMAP	Linear Programming Technique for Multidimensional Analysis of Preference
LSTM	Long Short-Term Memory
MDP	Markov Decision Process
MHA	Multi-Head Attention
MILP	Mixed-Integer Linear Programming
MLP	Multi-Layer Perceptron
PPO	Proximal Policy Optimization
PSO	Particle Swarm Optimization
RL	Reinforcement Learning
TEC	Total Expected Cost
VNS	Variable Neighborhood Search
WEEE	Waste Electrical and Electronic Equipment

References

Ongondo, F.O.; Williams, I.D.; Cherrett, T.J. How are WEEE doing? A global review of the management of electrical and electronic wastes. Waste Manag. 2011, 31, 714–730. [Google Scholar] [CrossRef]
Shittu, O.S.; Williams, I.D.; Shaw, P.J. Global E-waste management: Can WEEE make a difference? A review of e-waste trends, legislation, contemporary issues and future challenges. Waste Manag. 2021, 120, 549–563. [Google Scholar] [CrossRef]
Buekens, A.; Yang, J. Recycling of WEEE plastics: A review. J. Mater. Cycles Waste Manag. 2014, 16, 415–434. [Google Scholar] [CrossRef]
Owen, S.H.; Daskin, M.S. Strategic facility location: A review. Eur. J. Oper. Res. 1998, 111, 423–447. [Google Scholar] [CrossRef]
Chatterjee, A.; Anjaria, J.; Roy, S.; Ganguli, A.; Seal, K. SAGEL: Smart address geocoding engine for supply-chain logistics. In Proceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, Burlingame, CA, USA, 31 October–3 November 2016; pp. 1–10. [Google Scholar]
Salhofer, S.; Steuer, B.; Ramusch, R.; Beigl, P. WEEE management in Europe and China—A comparison. Waste Manag. 2016, 57, 27–35. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Liang, Q.; Li, M.; Qu, Z.; Zhang, Y. Dynamic scheduling in flexible and hybrid disassembly systems with manual and automated workstations using reward-shaping enhanced reinforcement learning. Eng. Appl. Artif. Intell. 2025, 150, 110588. [Google Scholar] [CrossRef]
Shi, J.; Chen, W.; Zhou, Z.; Zhang, G. A bi-objective multi-period facility location problem for household e-waste collection. Int. J. Prod. Res. 2020, 58, 526–545. [Google Scholar] [CrossRef]
Bereketli, I.; Genevois, M.E.; Albayrak, Y.E.; Ozyol, M. WEEE treatment strategies’ evaluation using fuzzy LINMAP method. Expert Syst. Appl. 2011, 38, 71–79. [Google Scholar] [CrossRef]
Gomes, M.I.; Barbosa-Povoa, A.P.; Novais, A.Q. Modelling a recovery network for WEEE: A case study in Portugal. Waste Manag. 2011, 31, 1645–1660. [Google Scholar] [CrossRef]
Wang, L.; Cai, K.; Song, Q.; Zeng, X.; Yuan, W.; Li, J. How effective are WEEE policies in China? A strategy evaluation through a PMC-index model with content analysis. Environ. Impact Assess. Rev. 2025, 110, 107672. [Google Scholar] [CrossRef]
Ayvaz, B.; Bolat, B.; Aydın, N. Stochastic reverse logistics network design for waste of electrical and electronic equipment. Resour. Conserv. Recycl. 2015, 104, 391–404. [Google Scholar] [CrossRef]
Bø, E.; Baxter, J. The effects of geographical, operational and service parameters on WEEE transport networks. Int. J. Logist. Res. Appl. 2017, 20, 342–358. [Google Scholar] [CrossRef]
Bruni, M.E.; Fadda, E.; Fedorov, S.; Perboli, G. A machine learning optimization approach for last-mile delivery and third-party logistics. Comput. Oper. Res. 2023, 157, 106262. [Google Scholar] [CrossRef]
He, W.; Li, G.; Ma, X.; Wang, H.; Huang, J.; Xu, M.; Huang, C. WEEE recovery strategies and the WEEE treatment status in China. J. Hazard. Mater. 2006, 136, 502–512. [Google Scholar] [CrossRef] [PubMed]
Aras, N.; Korugan, A.; Büyüközkan, G.; Şerifoğlu, F.S.; Erol, I.; Velioğlu, M.N. Locating recycling facilities for IT-based electronic waste in Turkey. J. Clean. Prod. 2015, 105, 324–336. [Google Scholar] [CrossRef]
Bruno, G.; Diglio, A.; Passaro, R.; Piccolo, C.; Quinto, I. Measuring spatial access to the recovery networks for WEEE: An in-depth analysis of the Italian case. Int. J. Prod. Econ. 2021, 240, 108210. [Google Scholar] [CrossRef]
Kilic, H.S.; Cebeci, U.; Ayhan, M.B. Reverse logistics system design for the waste of electrical and electronic equipment (WEEE) in Turkey. Resour. Conserv. Recycl. 2015, 95, 120–132. [Google Scholar] [CrossRef]
Hu, Z.; Wang, L.; Qin, J.; Lev, B.; Gan, L. Optimization of facility location and size problem based on bi-level multi-objective programming. Comput. Oper. Res. 2022, 145, 105860. [Google Scholar] [CrossRef]
Zhang, J.; Gao, M.; Zhao, L.; Hu, J.; Gao, J.; Deng, M.; Wan, C.; Yang, L. Multi-time Scale Attention Network for WEEE reverse logistics return prediction. Expert Syst. Appl. 2023, 211, 118610. [Google Scholar] [CrossRef]
Liang, T.F. Fuzzy multi-objective production/distribution planning decisions with multi-product and multi-time period in a supply chain. Comput. Ind. Eng. 2008, 55, 676–694. [Google Scholar] [CrossRef]
Agarwal, G.; Barari, S.; Tiwari, M. A PSO-based optimum consumer incentive policy for WEEE incorporating reliability of components. Int. J. Prod. Res. 2012, 50, 4372–4380. [Google Scholar] [CrossRef]
Azimi, P.; Saberi, E. An Efficient Hybrid Algorithm for Dynamic Facility Layout Problem using Simulation Technique and PSO. Econ. Comput. Econ. Cybern. Stud. Res. 2013, 47, 1–17. [Google Scholar]
Wang, S.; Watada, J. A hybrid modified PSO approach to VaR-based facility location problems with variable capacity in fuzzy random uncertainty. Inf. Sci. 2012, 192, 3–18. [Google Scholar] [CrossRef]
Miao, C.; Zhang, Y.; Wu, T.; Deng, F.; Chen, C. Deep reinforcement learning for multi-period facility location pk-median dynamic location problem. In Proceedings of the 32nd ACM International Conference on Advances in Geographic Information Systems, Atlanta, GA, USA, 29 October–1 November 2024; pp. 173–183. [Google Scholar]
Liang, H.; Wang, S.; Li, H.; Pan, J.; Li, X.; Su, C.; Liu, B. AIAM: Adaptive interactive attention model for solving p-Median problem via deep reinforcement learning. Int. J. Appl. Earth Obs. Geoinf. 2025, 138, 104454. [Google Scholar] [CrossRef]
Wang, C.; Han, C.; Guo, T.; Ding, M. Solving uncapacitated P-Median problem with reinforcement learning assisted by graph attention networks. Appl. Intell. 2023, 53, 2010–2025. [Google Scholar] [CrossRef]
Liang, H.; Wang, S.; Li, H.; Zhou, L.; Chen, H.; Zhang, X.; Chen, X. Sponet: Solve spatial optimization problem using deep reinforcement learning for urban spatial decision analysis. Int. J. Digit. Earth 2024, 17, 2299211. [Google Scholar] [CrossRef]
Wang, S.; Zhou, J.; Liang, H.; Wang, Z.; Su, C.; Li, X. A new approach for solving location routing problems with deep reinforcement learning of emergency medical facility. In Proceedings of the 8th ACM SIGSPATIAL International Workshop on Security Response using GIS, Hamburg, Germany, 13 November 2023; pp. 50–53. [Google Scholar]
Wu, D.; Wang, G.G. Employing reinforcement learning to enhance particle swarm optimization methods. Eng. Optim. 2022, 54, 329–348. [Google Scholar] [CrossRef]
Zhang, X.; Lin, Q. Three-learning strategy particle swarm algorithm for global optimization problems. Inf. Sci. 2022, 593, 289–313. [Google Scholar] [CrossRef]
Li, H.; Lu, Y. A bilevel programming location approach to regional waste electric and electronic equipment collection centers: A study in China. Math. Probl. Eng. 2021, 2021, 6669989. [Google Scholar] [CrossRef]
Ni, Z.; Chan, H.K.; Tan, Z. Break the Dilemma: An Innovative Reverse Logistic Network Model on E-Waste in China. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4251347 (accessed on 6 February 2026).
Yu, Y.; Si, X.; Hu, C.; Zhang, J. A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput. 2019, 31, 1235–1270. [Google Scholar] [CrossRef]
Greff, K.; Srivastava, R.K.; Koutník, J.; Steunebrink, B.R.; Schmidhuber, J. LSTM: A search space odyssey. IEEE Trans. Neural Netw. Learn. Syst. 2016, 28, 2222–2232. [Google Scholar] [CrossRef]
Temur, G.T.; Yanık, S. A novel approach for multi-period reverse logistics network design under high uncertainty. Int. J. Comput. Intell. Syst. 2017, 10, 1168–1185. [Google Scholar] [CrossRef]
Yu, H.; Solvang, W.D. A stochastic programming approach with improved multi-criteria scenario-based solution method for sustainable reverse logistics design of waste electrical and electronic equipment (WEEE). Sustainability 2016, 8, 1331. [Google Scholar] [CrossRef]
Chang, L.; Zhang, H.; Xie, G.; Yu, Z.; Zhang, M.; Li, T.; Tian, G.; Yu, D. Reverse logistics location based on energy consumption: Modeling and multi-objective optimization method. Appl. Sci. 2021, 11, 6466. [Google Scholar] [CrossRef]
Zhu, X.; Wang, J.; Tang, J. Recycling pricing and coordination of WEEE dual-channel closed-loop supply chain considering consumers’ bargaining. Int. J. Environ. Res. Public Health 2017, 14, 1578. [Google Scholar] [CrossRef] [PubMed]
Liu, G.; Sun, Q.; Su, H.; Wang, M. Adaptive cooperative fault-tolerant control for output-constrained nonlinear multi-agent systems under stochastic fdi attacks. IEEE Trans. Circuits Syst. I Regul. Pap. 2025, 72, 6025–6036. [Google Scholar] [CrossRef]
Peng, C.; Peng, L.; Chen, C. Observer-Based Distributed Model-Free Adaptive Control for Nonlinear MASs Under FDI Attacks and Channel Fading. Symmetry 2025, 17, 323. [Google Scholar] [CrossRef]
Yang, J.; Li, R.; Gan, Q.; Huang, X. Zero-sum-game-based fixed-time event-triggered optimal consensus control of multi-agent systems under FDI attacks. Mathematics 2025, 13, 543. [Google Scholar] [CrossRef]

Figure 1. Real-World Data Mapping and Simulation of Dynamic Problem Scenarios.

Figure 2. Schematic representation of the multi-period dynamic decision-making process.

Figure 3. Schematic diagram of the LAtt-PR hybrid framework architecture.

Figure 4. Architecture of the spatiotemporal attention-based DRL policy network.

Figure 5. Schematic diagram of the collaborative optimization loop between DRL and PSO.

Figure 6. Comprehensive comparison of operational efficiency and cost composition across different algorithms.

Figure 7. Learning stability and convergence efficiency of different methods.

Figure 8. Multi-Period Facility Location Decision Flow Using LAtt-PR Model.

Figure 9. Spatiotemporal evolution of the WEEE recycling network in the real-world Zhongguancun case study.

Figure 10. Comparison of training convergence curves between LAtt-PR algorithm and ablation variant (w/o PSO) under different experimental scenarios.

Figure 11. Comparison of Ablation Study Results.

Table 1. Comparison of LAtt-PR with existing methodologies in WEEE facility location.

Study Type	Representative Refs.	Demand Modeling	Optimization Horizon	Scalability	Core Methodology
Exact Solvers	Gomes et al. [10], Shi et al. [8]	Deterministic/Static Snapshots	Single/Multi-period (High Computation)	Low	MILP/Branch-and-Bound
Metaheuristics	Agarwal et al. [22], Azimi et al. [23]	Static/Simple Stochastic	Single-period (Ignores Sunk Costs)	Medium	GA, PSO (Random Search)
Pure DRL	Miao et al. [25], Wang et al. [27]	Dynamic/Non-stationary	Multi-period (Cumulative Reward)	High	Neural Network Approx.
This paper	LAtt-PR	Non-stationary (LSTM + Attention)	Multi-period (Balances Sunk Costs)	High	Hybrid (DRL + PSO)

Table 2. Notation and Definitions of Parameters and Variables.

Symbol	Definition
T	Total number of periods, $t = 1, 2, \dots, T$ denotes period t
I	Set of candidate facility locations, $i \in I$
J	Set of waste generation sources, $j \in J$
$D_{i j}$	Distance between candidate facility i and source j
$q_{j}^{t}$	Amount of WEEE generated at source j in period t
$x_{i}^{t} \in {0, 1}$	1 if facility i is constructed in period t; 0 otherwise
$y_{i}^{t} \in {0, 1}$	1 if facility i is built and operational by period t; 0 otherwise
$c_{i}$	Construction cost of facility i
$o_{i}$	Unit operating cost of facility i
$r_{i}$	Service radius of facility i
$γ$	Unit transportation cost
$λ$	Penalty coefficient for unserved waste
$z_{i j}^{t} \in {0, 1}$	1 if source j is served by facility i in period t; 0 otherwise

Table 3. Problem scale settings for different test gradients.

Scale	Demand Points	Candidate Sites	Planning Cycles
S1	20	8	2
S2	50	15	3
S3	100	30	4
S4	150	50	5

Table 4. Parameter Settings for the Optimization Model.

Parameter Name	Symbol	Value
Fixed construction cost	$f_{i}$	1.00
Unit operating cost	$o_{i}$	0.05
Transportation cost rate	$p_{i}$	0.01
Unit penalty cost	$c_{p}$	1.00
Facility capacity	$C a p_{i}$	20
Maximum service radius	$R_{max}$	0.30

Table 5. Hyperparameter configurations for model training.

Parameter Name	Value/Configuration
Penalty coefficient ( $λ$ )	1.0
Discount factor ( $γ$ )	0.99
Instances per epoch	5120
Training epochs	200
Batch size	512
Optimizer	Adam
Learning rate ( $η$ )	$1 \times 10^{- 4}$
Embedding and hidden dimension	128
Number of encoder layers	5
Number of decoder layers	5
Number of attention heads	8
Population size	100
Maximum iterations	50
Inertia weight ( $ω$ )	0.9 → 0.4
Individual learning factor ( $c_{1}$ )	2.0
Social learning factor ( $c_{2}$ )	2.0

Table 6. Comprehensive performance comparison of seven methods across different problem scales.

Scale	Method	Obj.	Gap (%)	Time (s)
$S_{1}$ (20 nodes)	Gurobi	6.84	0	1.22
	Or-Tools	6.84	0	1.39
	GA	6.98	2.05	0.27
	PSO	6.97	1.90	0.19
	DQN	6.91	1.02	0.03
	PPO	6.89	0.73	0.04
	LAtt-PR	6.86	0.29	0.23
$S_{2}$ (50 nodes)	Gurobi	10.76	0	4.79
	Or-Tools	10.76	0	5.04
	GA	11.23	4.37	0.71
	PSO	11.21	4.18	0.66
	DQN	11.09	3.07	0.13
	PPO	11.07	2.88	0.17
	LAtt-PR	10.95	1.77	0.62
$S_{3}$ (100 nodes)	Gurobi	21.59	0	6.17
	Or-Tools	21.59	0	6.86
	GA	24.92	15.42	1.13
	PSO	24.91	15.38	1.03
	DQN	23.48	8.04	0.28
	PPO	23.11	7.04	0.36
	LAtt-PR	22.39	3.71	0.97
$S_{4}$ (150 nodes)	Gurobi	32.87	0	10.61
	Or-Tools	32.87	0	11.29
	GA	38.36	16.71	2.38
	PSO	38.38	16.76	2.28
	DQN	36.04	9.64	0.54
	PPO	35.79	8.88	0.68
	LAtt-PR	34.18	3.98	1.71

Note: Gap is calculated relative to the Gurobi solution. Bold values represent the results obtained by the proposed LAtt-PR method.

Table 7. Paired t-test results comparing LAtt-PR with baseline algorithms (Scale S4).

Comparison Pair	Mean Difference	t-Statistic	p-Value	Significance Level
LAtt-PR vs. GA	−4.18	−15.42	$1.2 \times 10^{- 6}$	(p < 0.01)
LAtt-PR vs. PSO	−4.20	−14.88	$1.5 \times 10^{- 6}$	(p < 0.01)
LAtt-PR vs. DQN	−1.86	−5.32	$4.2 \times 10^{- 3}$	(p < 0.01)
LAtt-PR vs. PPO	−1.61	−4.95	$6.8 \times 10^{- 3}$	(p < 0.01)

Table 8. Analysis of operational efficiency and cost composition for various algorithms under representative scales.

Scale	Method	ASR (%)	CU (%)	Const. (%)	Oper. (%)	Trans. (%)	Penal. (%)
$S_{3}$	Gurobi	99.8	85.2	15.6	42.3	42.1	0.0
	GA	96.5	78.9	14.2	40.1	40.8	4.9
	PSO	96.8	79.1	14.3	40.3	40.7	4.7
	DQN	98.2	82.3	15.0	41.5	41.2	2.3
	PPO	98.5	83.1	15.2	41.8	41.5	1.5
	LAtt-PR	99.2	84.6	15.4	42.0	42.3	0.3
$S_{4}$	Gurobi	99.7	83.5	16.8	43.2	40.0	0.0
	GA	94.2	75.4	15.1	39.8	38.5	6.6
	PSO	94.5	75.8	15.2	39.9	38.4	6.5
	DQN	97.1	80.2	16.0	42.1	39.8	2.1
	PPO	97.6	81.0	16.3	42.3	39.9	1.5
	LAtt-PR	98.9	82.8	16.6	43.0	40.1	0.3

Note: 1. Cost composition columns (Const., Oper., Trans., Penal.) represent the percentage of each sub-item relative to the total cost. 2. ASR and CU values are presented as percentages. 3. Data represent the average results across 20 independent experimental runs. 4. Bold values indicate the best results among the heuristic algorithms.

Table 9. Robustness analysis of TEC under varying demand uncertainty levels (

σ

).

Table 9. Robustness analysis of TEC under varying demand uncertainty levels (

σ

).

Method	Low Noise ( $σ = 0.1$ )		Medium Noise ( $σ = 0.3$ )		High Noise ( $σ = 0.5$ )
Method	TEC	Inc. (%)	TEC	Inc. (%)	TEC	Inc. (%)
LAtt-PR	34.18	-	37.42	+9.48%	41.15	+20.39%
PPO	35.79	-	40.33	+12.69%	46.68	+30.43%
DQN	36.04	-	41.15	+14.18%	48.05	+33.32%
PSO	38.38	-	44.27	+15.35%	51.95	+35.36%
GA	38.36	-	44.61	+16.29%	52.23	+36.16%

Note: “Inc. (%)” denotes the percentage increase in cost relative to the baseline scenario (

σ = 0.1

). A lower increase indicates higher robustness. Bold values indicate the best results (lowest cost and smallest increase) among the compared methods.

Table 10. Sensitivity analysis of key hyperparameters on model performance (Scale S4).

Parameter	Configuration	TEC	ASR (%)	Gap (%)
$λ$	0.5	30.50	92.1	−10.8%
	1.0	34.18	98.9	-
	5.0	38.90	99.5	+13.8%
$η$	$10^{- 3}$	37.20	97.5	+8.8%
	$10^{- 4}$	34.18	98.9	-
	$10^{- 5}$	42.10	98.2	+23.2%
$ω$	0.7	35.02	98.6	+2.5%
$ω$	$0.9 \to 0.4$	34.18	98.9	-

Note: Gap represents the percentage deviation in TEC relative to the Baseline configuration. ASR denotes Average Service Rate. Bold values indicate the baseline parameter configuration adopted in this study.

Table 11. Summary of model variants and architectural modifications for ablation analysis.

Variant	Model Name	Structural Modification Description
M0	LAtt-PR	Full model: GAT and FFN extract spatial topology and node intrinsic features in parallel, followed by fusion and LSTM input.
M1	w/o FFN	FFN branch removed; retains only the GAT path for spatial topological feature extraction.
M2	w/o GAT	GAT branch removed; retains only the FFN path for node intrinsic feature extraction.
M3	w/o GRU	GRU module removed and replaced by fully connected layers.
M4	w/o LSTM	LSTM module removed and replaced by a Multi-Layer Perceptron (MLP) for feature mapping and dimensionality reduction.
M5	w/o Temporal	Temporal dimensions removed; greedy decision-making based solely on the current state snapshot.
M6	w/o MHA	MHA removed; relies solely on GAT for local neighborhood feature aggregation.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Qu, Z.; Ye, X.; Zhang, Y.; Wang, J. LAtt-PR: Hybrid Reinforced Adaptive Optimization for Conquering Spatiotemporal Uncertainties in Dynamic Multi-Period WEEE Facility Location. Mathematics 2026, 14, 612. https://doi.org/10.3390/math14040612

AMA Style

Qu Z, Ye X, Zhang Y, Wang J. LAtt-PR: Hybrid Reinforced Adaptive Optimization for Conquering Spatiotemporal Uncertainties in Dynamic Multi-Period WEEE Facility Location. Mathematics. 2026; 14(4):612. https://doi.org/10.3390/math14040612

Chicago/Turabian Style

Qu, Zelin, Xiaoyun Ye, Yuanyuan Zhang, and Jinlong Wang. 2026. "LAtt-PR: Hybrid Reinforced Adaptive Optimization for Conquering Spatiotemporal Uncertainties in Dynamic Multi-Period WEEE Facility Location" Mathematics 14, no. 4: 612. https://doi.org/10.3390/math14040612

APA Style

Qu, Z., Ye, X., Zhang, Y., & Wang, J. (2026). LAtt-PR: Hybrid Reinforced Adaptive Optimization for Conquering Spatiotemporal Uncertainties in Dynamic Multi-Period WEEE Facility Location. Mathematics, 14(4), 612. https://doi.org/10.3390/math14040612

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LAtt-PR: Hybrid Reinforced Adaptive Optimization for Conquering Spatiotemporal Uncertainties in Dynamic Multi-Period WEEE Facility Location

Abstract

1. Introduction

2. Problem Description and Modeling

2.1. Multi-Period Facility Location Model for WEEE Recycling Networks

2.2. Reinforcement Learning Framework

3. Methodology

3.1. LAtt-PR

3.2. PSO-RL Collaborative Mechanism

4. Experiments

4.1. Experimental Setup

4.2. Comparative Analysis of Overall Performance

4.3. Ablation Experiments for Critical Components

5. Conclusions and Future Outlook

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI