Optimization of Monitoring Node Layout in Desert–Gobi–Wasteland Regions Based on Deep Reinforcement Learning

Han, Zifen; Lv, Qingquan; Xie, Zhihua; Li, Runxiang; Huo, Jiuyuan

doi:10.3390/sym18020237

Open AccessArticle

Optimization of Monitoring Node Layout in Desert–Gobi–Wasteland Regions Based on Deep Reinforcement Learning

by

Zifen Han

¹,

Qingquan Lv

¹,

Zhihua Xie

²,

Runxiang Li

³ and

Jiuyuan Huo

^4,*

¹

Gansu Electric Power Company of State Grid, Lanzhou 730046, China

²

Zhangye Power Supply Company of Gansu Electric Power Company of State Grid, Zhangye 734000, China

³

Artificial Intelligence and Big Data Research Institute, Lanzhou Dafang Electronics Co., Ltd., Lanzhou 730070, China

⁴

School of Electronic and Information Engineering, Lanzhou Jiaotong University, Lanzhou 730070, China

^*

Author to whom correspondence should be addressed.

Symmetry 2026, 18(2), 237; https://doi.org/10.3390/sym18020237

Submission received: 15 December 2025 / Revised: 22 January 2026 / Accepted: 26 January 2026 / Published: 29 January 2026

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

Desert–Gobi–wasteland regions possess abundant wind resources and are strategic areas for future renewable energy development and meteorological monitoring. However, existing studies have limited capability in addressing the highly complex and dynamic environmental characteristics of these regions. In particular, few modeling approaches can jointly represent terrain variability, solar radiation distribution, and wind-field characteristics within a unified framework. Moreover, conventional deep reinforcement learning methods often suffer from learning instability and coordination difficulties when applied to multi-agent layout optimization tasks. To address these challenges, this study constructs a multidimensional environmental simulation model that integrates terrain, solar radiation, and wind speed, enabling a quantitative and controllable representation of the meteorological monitoring network layout problem. Based on this environment, an Environment-Aware Proximal Policy Optimization (EA-PPO) algorithm is proposed. EA-PPO adopts a compact environment-related state representation and a utility-guided reward mechanism to improve learning stability under decentralized decision-making. Furthermore, a Global Layout Optimization Algorithm based on EA-PPO (GLOAE) is developed to enable coordinated optimization among multiple monitoring nodes through shared utility feedback. Simulation results demonstrate that the proposed methods achieve superior layout quality and convergence performance compared with conventional approaches, while exhibiting enhanced robustness under dynamic environmental conditions. These results indicate that the proposed framework provides a practical and effective solution for intelligent layout optimization of meteorological monitoring networks in desert–Gobi–wasteland regions.

Keywords:

deep reinforcement learning; layout optimization; meteorological monitoring; multi-agent systems

1. Introduction

Desert–Gobi–wasteland regions possess abundant and relatively stable wind resources and represent strategic areas for China’s future renewable energy development and meteorological prediction [1]. Due to their unique geographical conditions and fragile ecological environments, the scientific deployment of meteorological monitoring networks in these regions is critically important for renewable resource utilization, climate trend forecasting, and ecological environment management [2]. However, the complex terrain, highly variable climatic conditions, and frequent aeolian activities in desert–Gobi–wasteland regions pose significant challenges to monitoring node layout optimization. These challenges include strong coupling among multiple environmental factors, pronounced spatial heterogeneity, and persistent dynamic disturbances [3]. As a result, designing an efficient and stable meteorological monitoring system under such extreme environmental conditions remains a major scientific and engineering challenge in the renewable energy domain. Therefore, intelligent optimization of monitoring network layout in desert–Gobi–wasteland regions is of both scientific and practical significance.

Existing layout optimization methods for meteorological monitoring networks mainly include geometry-based uniform deployment strategies [4] and heuristic local search algorithms [5]. While these approaches can achieve acceptable performance in small- to medium-scale scenarios, they often struggle in large-scale environments characterized by strong spatial heterogeneity and nonlinear wind-field structures, such as desert–Gobi–wasteland regions. Although heuristic algorithms can partially improve search efficiency, they are typically sensitive to parameter settings, prone to unstable convergence, and insufficient for handling dynamic environmental variations [6].

In recent years, deep reinforcement learning (DRL) has shown considerable potential for layout optimization and resource allocation problems due to its adaptive and self-learning capabilities [7]. Nevertheless, conventional DRL methods still suffer from the curse of dimensionality and unstable convergence when applied to high-dimensional continuous state spaces [8], which limits their effectiveness in layout optimization tasks involving complex terrain–environment interactions.

To address these limitations, this study focuses on desert–Gobi–wasteland regions and constructs a multidimensional simulation environment that integrates key environmental factors, including terrain, solar radiation, and wind speed [9]. This environment enables a realistic and quantitative representation of layout constraints and environmental attributes. To overcome the efficiency and stability bottlenecks of traditional DRL in high-dimensional layout problems, we propose an Environment-Aware Proximal Policy Optimization (EA-PPO) algorithm. Built upon the PPO framework, EA-PPO incorporates environment-aware state representations and an adaptive reward mechanism, allowing agents to make decentralized decisions under terrain cost and wind energy potential constraints, thereby improving learning stability and optimization performance. Furthermore, based on EA-PPO, we develop a Global Layout Optimization Algorithm (GLOAE), which employs a cooperative multi-agent learning strategy to achieve layout optimization oriented toward global utility.

The main contributions of this paper are summarized as follows:

(1): Multidimensional environmental modeling. A multidimensional simulation environment model is constructed for desert–Gobi–wasteland regions by integrating terrain, solar radiation, and wind-speed information. This model enables quantitative description and simulation of meteorological monitoring network layout problems.
(2): Environment-aware reinforcement learning algorithms. To address the dimensionality and convergence challenges of traditional DRL in high-dimensional layout tasks, the EA-PPO algorithm is proposed, significantly enhancing policy learning efficiency and stability. Building upon EA-PPO, the GLOAE is further developed to achieve adaptive node deployment through a collaborative, global-benefit-oriented multi-agent learning mechanism.
(3): Comprehensive experimental validation. Extensive simulation experiments demonstrate that EA-PPO and GLOAE outperform conventional methods in terms of layout quality, convergence speed, global utility, and adaptability to dynamic environmental changes. The results indicate that the proposed methods exhibit strong practicality and generalization potential in complex desert–Gobi–wasteland environments.

2. Related Work

Existing layout optimization approaches can be broadly categorized into three groups: mathematical optimization methods, intelligent heuristic optimization methods, and deep reinforcement learning–driven intelligent layout methods. This section provides a systematic overview of the research progress in each category.

2.1. Mathematical Optimization Methods

Mathematical optimization methods represent the earliest and most widely adopted approaches for layout optimization. Their core idea is to formulate layout problems as mathematically tractable optimization models with explicit objective functions, enabling analytical or semi-analytical solutions under given constraints. Owing to their clear modeling structure and relatively low computational cost, these methods have been extensively applied across various engineering fields.

Zhang et al. [10] proposed a high-dimensional vector encoding technique to formalize pipeline system layout optimization as a novel mathematical representation, introducing parameter-adaptive equations to facilitate high-dimensional solution search. Guo et al. [11] incorporated atmospheric stability and wake effects into wind farm layout optimization and established a mathematical optimization framework based on an improved Gaussian wake model, thereby significantly enhancing overall power generation efficiency. Zhang et al. [12] addressed the array layout of wave energy converters by proposing a chaotic differential evolution (CDE) algorithm, which dynamically adjusts population search strategies through a three-layer information interaction structure to achieve more stable global optimization performance. In addition, Zhang et al. [13] formulated a mixed-integer nonlinear programming (MINLP) model for logistics distribution node layout, effectively balancing transportation costs and spatial distribution uniformity.

Overall, mathematical optimization methods offer advantages such as simple structure, ease of implementation, and low computational burden, making them suitable for small-scale or idealized environments. However, these methods typically rely on explicit problem formulations and have limited capability in handling environmental uncertainty and high-dimensional dynamic characteristics. Moreover, they are prone to convergence toward local optima, which restricts their applicability in complex nonlinear systems, such as meteorological monitoring networks in desert–Gobi–wasteland regions.

2.2. Intelligent Optimization Methods

Intelligent optimization methods address complex layout optimization problems through heuristic or swarm intelligence mechanisms, exhibiting strong global search capability and robustness. By simulating biological behaviors, natural evolution, or group cooperation, these methods iteratively explore the solution space and are particularly effective for high-dimensional problems that are difficult to handle using traditional mathematical optimization techniques.

Xu et al. [14] investigated workshop layout optimization by proposing an improved particle swarm optimization (IPSO) algorithm that incorporates task collaboration and low-carbon logistics objectives, enabling coordinated optimization of task scheduling and spatial layout. Daqaq et al. [15] formulated wind farm layout optimization as a nonlinear optimization problem and enhanced the global search capability of the manta ray foraging optimization (MRFO) algorithm by introducing chaotic mapping, thereby achieving higher solution accuracy and faster convergence. Shan et al. [16] combined chaotic mechanisms with genetic algorithms (GA) and particle swarm optimization (PSO) to improve the ability of wind farm layout algorithms to escape local optima. Da et al. [17] developed a modular optimization framework that integrates genetic algorithms with simulation tools for offshore wind farm layout, achieving a balanced trade-off between energy capture efficiency and spatial utilization.

Despite their effectiveness in addressing complex, nonlinear, and multi-objective layout problems, intelligent optimization algorithms still exhibit several limitations [18]. First, they are often highly sensitive to initial parameter settings and fitness function design, which makes them prone to convergence toward local optima. Second, their computational complexity increases significantly in dynamic environments or large-scale node configurations. Third, most intelligent optimization methods are primarily designed for offline optimization and exhibit limited real-time adaptability. These limitations restrict their applicability in high-dimensional and dynamically changing environments.

2.3. Deep Reinforcement Learning–Driven Intelligent Layout Methods

With rapid advancements in artificial intelligence, deep reinforcement learning (DRL) has been widely applied to self-learning and decision optimization problems in complex systems [19]. By combining the feature extraction capability of deep learning with the policy optimization mechanism of reinforcement learning, DRL enables end-to-end policy training in unsupervised or partially observable environments. This provides a new perspective and technical pathway for the layout optimization of meteorological monitoring networks.

Zhang et al. [20] proposed a congestion-aware reward shaping mechanism for circuit layout optimization, integrating exploration–exploitation balancing and a hierarchical policy structure into a DRL framework to improve convergence efficiency and global optimization performance. Zhou et al. [21] designed a multi-objective reward mechanism and hierarchical state representation to simultaneously address coverage optimization and energy balance constraints in complex scenarios, providing valuable insights for multi-agent layout optimization. Pushpa et al. [22] combined DRL with graph neural networks (GNNs) to leverage spatial dependencies in graph structures, thereby enhancing global perception and operational efficiency while maintaining adaptive decision-making capability. Chowdhuri et al. [23] formulated node layout optimization within a Lyapunov-based framework and proposed a hybrid DRL approach that performs hierarchical decision-making using intra-cluster and inter-cluster information, effectively reducing energy consumption and communication voids while improving convergence speed.

Overall, DRL-driven layout optimization methods demonstrate strong generalization ability, dynamic adaptability, and distributed decision-making performance in complex multi-constraint environments. However, several challenges remain, including insufficient training stability, state space dimensionality explosion, and difficulties in accurate environment modeling.

3. Simulation Environment Construction and Problem Modeling

Based on the above research background, a controlled simulation environment inspired by the typical terrain and climatic characteristics of desert–Gobi–wasteland regions is constructed to evaluate the proposed algorithms under heterogeneous environmental conditions. This section first introduces the modeling of the environmental map, including the simulation and integration of key factors such as terrain, solar radiation, and wind speed. It then describes the modeling assumptions and observation mechanisms of the monitoring nodes. Finally, the mathematical formulation of the layout optimization problem and its associated constraints are presented, providing the theoretical and data foundations for the subsequent experimental design and validation.

3.1. Simulation Modeling of the Desert–Gobi–Wasteland Regions

For simulation purposes, the study area is defined as a square region with a side length of 10 km. The region is discretized into a regular grid to enable controlled evaluation and fair algorithmic comparison. Following relevant studies [24], three environmental factors—terrain, solar radiation, and wind speed—are selected as key constraints influencing the deployment of monitoring nodes in desert–Gobi–wasteland regions, and corresponding simulation models are constructed. Specifically, the terrain cost map represents the construction and maintenance costs of monitoring nodes and reflects the impact of abstracted surface variability and terrain-related factors on deployment feasibility. The solar radiation map describes the spatial distribution of energy harvesting potential, indicating differences in the self-supply capability of monitoring nodes. The wind speed intensity map characterizes the relative spatial distribution of meteorological observation value, reflecting the contribution of wind-field variations to information acquisition.

By integrating these environmental models, spatial heterogeneity and energy distribution patterns relevant to desert–Gobi–wasteland deployment scenarios are captured within a controlled and reproducible simulation framework. This framework provides a repeatable foundation for subsequent algorithm design and performance evaluation. The modeling procedures for the terrain cost map, solar radiation map, and wind speed intensity map are described in detail below.

3.1.1. Terrain Cost Map

To simulate the continuous undulating terrain characteristics of desert–Gobi–wasteland regions in an engineering-oriented manner, Gaussian-smoothed noise is adopted to construct the terrain cost distribution. Specifically, a random noise field is first generated within the study region. This noise field is then convolved with a Gaussian kernel, and the resulting output is normalized to obtain a Gaussian-smoothed normalized map, from which the continuous terrain cost map is derived. The corresponding expressions are given in Equations (1) and (2):

B_{σ} (x, y) = \frac{G_{σ} \times n (x, y) - \min (G_{σ} \times n)}{\max (G_{σ} \times n) - \min (G_{σ} \times n)},

(1)

C_{σ} (x, y) = R_{σ} (x, y),

(2)

where

B_{σ} (x, y)

is the Gaussian-smoothed base map, * denotes two-dimensional convolution, and

σ

is the standard deviation of the Gaussian kernel. The standard deviation of the Gaussian kernel is set to σ = 2, which corresponds to a moderate spatial smoothing scale under the adopted grid resolution. This choice allows the generated terrain map to preserve continuous undulating characteristics while avoiding excessive high-frequency noise or over-smoothing effects. Larger values of

C (x, y)

indicate more complex terrain and higher accessibility cost. The resulting terrain cost map is shown in Figure 1.

As shown, the terrain cost map generated based on Gaussian-smoothed noise captures representative continuous and uneven geomorphological characteristics of desert–Gobi–wasteland regions. Colors ranging from green to brown to black indicate increasing terrain complexity. Green regions represent relatively flat and stable surfaces with lower construction and maintenance cost; brown regions correspond to hilly or dune-dominated areas with moderate complexity; and black regions indicate rugged or loose-soil regions associated with higher deployment difficulty.

3.1.2. Solar Radiation Map

Solar radiation intensity is an important factor influencing the energy supply capability of monitoring nodes, and its spatial distribution generally exhibits smooth variation and terrain dependency. Terrain undulation can introduce shading and occlusion effects, which tend to reduce effective irradiance in elevated or steep-slope areas. To account for this influence in an engineering simulation framework, a terrain-related constraint term is introduced based on the Gaussian-smoothed normalized map

B_{σ} (x, y)

, thereby forming the solar radiation distribution model, as expressed in Equation (3).

R_{σ} (x, y) = B_{σ} (x, y) (1 - α C_{σ} (x, y)), α \in [0, 1],

(3)

where

α

is the terrain-shading coefficient, set to 0.5. The resulting solar radiation distribution is shown in Figure 2.

As shown, the solar radiation map R(x,y), generated by combining Gaussian-smoothed noise with terrain-related constraints, represents the spatial variation in solar radiation potential in desert–Gobi–wasteland regions within the simulation framework. Radiation intensity varies smoothly across space and exhibits clear regional differences. Brighter regions correspond to areas with relatively sufficient solar exposure and higher energy harvesting potential, whereas darker brown and black regions indicate low-irradiance areas influenced by terrain shading and occlusion.

3.1.3. Wind Speed Intensity Map

Wind speed and wind–sand intensity are important factors influencing the monitoring value and signal interference of nodes in desert–Gobi–wasteland regions. To characterize their spatial variation within an engineering simulation framework, a dominant wind direction is assumed along the diagonal of the study region, thereby introducing a global gradient across the wind field. In addition, local perturbations associated with terrain roughness are incorporated. Based on the Gaussian-smoothed normalized map

B_{σ} (x, y)

, a directional gradient term and a terrain-related modification term are introduced to represent the spatial variation in wind speed. The resulting model is expressed in Equation (4).

W_{σ} (x, y) = N o r m (B_{σ} (x, y) + γ \frac{x + y}{\sqrt{2} N}) (1 - β C_{σ} (x, y)),

(4)

where

γ

is the gradient coefficient of the dominant wind direction (set to 0.4),

β

is the terrain attenuation coefficient of wind speed (set to 0.4), N is the number of coordinate points.

The resulting wind speed intensity map is shown in Figure 3.

As illustrated, the wind speed intensity map exhibits a non-uniform spatial distribution, forming a continuous gradient from relatively low to high wind-speed regions within the simulation framework. Blue regions represent low-wind zones characterized by relatively stable airflow; green to yellow regions correspond to moderate wind-speed zones with more active airflow; and red regions indicate high-wind zones associated with greater wind energy potential and higher monitoring value.

3.2. Simulation Region and Node Sensing Model

In the complex environment of desert–Gobi–wasteland regions, the deployment of monitoring nodes directly influences the overall coverage performance and observation quality of the monitoring network. To construct a reasonable node sensing model within the simulation framework, target points are placed at 1 km intervals across the study region, forming a discretized environmental grid. Monitoring nodes are deployed at integer-coordinate locations, and each node is assigned a sensing radius of 1 km to provide effective coverage of representative environmental information areas.

To ensure monitoring reliability, a discretized disk-based sensing model is adopted to characterize the sensing behavior of nodes. In this model, each node is assumed to have a circular sensing area with a fixed radius, within which target points can be fully detected, while detection outside this area is not considered. Let node i and target point j be given; the corresponding sensing model is defined in Equations (5) and (6).

d_{i, j} = \sqrt{{(x_{i} - x_{j})}^{2} + {(y_{i} - y_{j})}^{2}},

(5)

P_{i, j} = \{\begin{cases} 1, d_{i, j} \leq d \\ 0, d_{i, j} > d \end{cases},

(6)

where

d_{i, j}

is the Euclidean distance between node i and target point j, and

d

is the sensing radius.

3.3. Problem Description

In the layout optimization problem for monitoring networks in desert–Gobi–wasteland regions, node deployment is jointly affected by terrain conditions, solar radiation, wind speed, and coverage performance. The previously constructed terrain cost map, solar radiation map, and wind speed map provide spatial information on deployment cost, energy availability, and observation value, respectively. In addition, the designed node sensing model characterizes the interaction relationships among monitoring nodes.

For a given target point, multiple monitoring nodes may detect it simultaneously. The number of detections for target point

j

is denoted as

c_{j}

. Let

N

denote the number of target points within the sensing range of a monitoring node. The average sensing count over all covered target points is then defined in Equation (7).

C_{i} = \frac{1}{N} \sum_{n = 1}^{N} c_{n},

(7)

Based on Equation (7), a sensing-overlap penalty term is introduced to reduce redundant sensing of the same target point by multiple monitoring nodes. Using the above formulation, the overall utility of a monitoring node is defined as shown in Equation (8).

V_{i} = w_{1} (1 - C_{σ} (x_{i}, y_{i})) + w_{2} R_{σ} (x_{i}, y_{i}) + w_{3} W_{σ} (x_{i}^{N}, y_{i}^{N}) + w_{4} / C_{i},

(8)

where

W_{σ} (x_{i}^{N}, y_{i}^{N})

denotes the meteorological observation value of the N target points within the sensing range of node i, and

w_{k} (k = 1,2, 3,4)

are weighting coefficients. For a single monitoring node, the optimization objective can be expressed as Equation (9).

\max V_{i} \to \max \sum_{i = 1}^{I} V_{i},

(9)

By introducing the sensing-overlap penalty into the utility function, each monitoring node is encouraged to optimize its local utility while implicitly accounting for the global layout configuration. This formulation promotes coordinated behavior among multiple nodes by reducing redundant sensing, thereby alleviating excessive coupling effects during multi-agent optimization. As a result, the proposed utility design contributes to improved training stability and computational efficiency in practice, without requiring explicit centralized control.

4. Algorithm Design and Implementation

4.1. Overview of the PPO Algorithm

Proximal Policy Optimization (PPO) [25] is a policy-gradient-based reinforcement learning algorithm built on the Actor–Critic (A–C) architecture. During training, PPO performs multiple mini-batch gradient updates within each iteration, which improves sample efficiency and promotes stable convergence. The key idea of PPO is the introduction of a clipping mechanism in the objective function to constrain the update magnitude between the new and old policies. This mechanism prevents excessively large policy updates that could destabilize the training process. By balancing policy improvement with update constraints, PPO achieves stable and reliable policy optimization in practice.

Compared with traditional policy-gradient algorithms, PPO features a relatively simple structure, stable training behavior, and straightforward hyperparameter tuning. Due to these advantages, PPO has been widely adopted in various reinforcement learning applications. Motivated by its stability and adaptability, this study develops an Environment-Aware Proximal Policy Optimization (EA-PPO) algorithm to enable efficient learning and optimization of monitoring node deployment strategies under complex environmental conditions.

4.2. Environment-Aware Layout Optimization Algorithm: EA-PPO

4.2.1. MDP Modeling

The deep reinforcement learning process is formulated as a Markov Decision Process (MDP), represented by the tuple ⟨S, A, P, R, γ⟩, where S denotes the state space, A the action space, P the state transition probability, R the reward function, and γ the discount factor. In practical algorithm design, the specification of the state representation, action space, and reward mechanism plays a critical role in enabling effective learning under complex and coupled environments. In the context of monitoring network layout optimization, this modeling process presents inherent challenges. On the one hand, each agent needs to incorporate sufficient information reflecting global layout tendencies to make reasonable deployment decisions; on the other hand, an excessively rich state representation may significantly increase the dimensionality of the state space, thereby negatively affecting training stability and convergence. Guided by the principles of information sufficiency and model trainability, this study proposes a compact and structured design for the state, action, and reward components, which are detailed as follows.

State Space:

S = [n_{u}, n_{d}, n_{l}, n_{r}, v_{c}, v_{u}, v_{d}, v_{l}, v_{r}]

(10)

The state space represents the local perception of an individual monitoring node within the deployment environment. Here,

n_{u}, n_{d}, n_{l}, n_{r}

denote the numbers of detected target points within the sensing radius in the upward, downward, leftward, and rightward directions, respectively, reflecting the directional coverage density around the node. The terms

v_{u}, v_{d}, v_{l}, v_{r}

correspond to the associated directional value coefficients derived from Equation (8).

To define directional perception, the circular sensing region of each node is partitioned into four overlapping semicircular regions corresponding to the upward, downward, leftward, and rightward directions. Specifically, the “upward” region refers to the upper semicircle relative to the node’s position, while the “downward”, “leftward”, and “rightward” regions are defined analogously. A target point may therefore contribute to multiple directional counts if it lies near the boundary between regions. This design provides coarse directional imbalance information while avoiding fine-grained angular discretization.

This compact state representation enables each agent to infer aggregated spatial imbalance tendencies from local observations, allowing decentralized decision-making without requiring explicit access to the full global state or direct inter-agent communication. It should be noted that this aggregated information does not constitute explicit global state input, and the environment remains coupled due to interactions among multiple agents. Although the directional partition is coarse, it is sufficient for incremental layout adjustment, as the optimization objective focuses on reducing large-scale coverage imbalance rather than fine-grained geometric optimization. The overall design balances information sufficiency and state-space compactness, thereby supporting effective learning while maintaining training stability.

Action Space:

A = [a_{0}, a_{1}, a_{2}, a_{3}, a_{4}]

(11)

The action space defines the set of feasible spatial movements of a monitoring node. At each decision step, the agent can choose to remain at its current location or move one discrete step in one of the four cardinal directions (upward, downward, leftward, or rightward) within the discretized deployment region.

Reward Mechanism:

R = \{\begin{cases} 1, v_{i}^{t} > v_{i}^{t - 1} \\ 0, v_{i}^{t} = v_{i}^{t - 1} \\ - 1, v_{i}^{t} < v_{i}^{t - 1} \end{cases},

(12)

The reward mechanism is designed to provide robust utility-guided feedback for decentralized learning. Instead of directly using the magnitude of utility variation as a continuous reward, a ternary reward scheme is adopted to indicate whether the global utility increases, remains unchanged, or decreases after an action. This design focuses on the direction of improvement rather than the exact utility difference. In practice, incremental movements often result in small and noisy utility changes, especially in later optimization stages. The ternary reward helps suppress noise amplification and provides a stable learning signal, which is beneficial for training stability in coupled multi-agent environments.

Through the joint design of environmental features, node actions, and directional utility-guided reward feedback, the proposed MDP formulation supports stable policy learning based on local observations, facilitating coordinated layout improvement in coupled multi-agent deployment environments.

4.2.2. EA-PPO Algorithm Description

The Environment-Aware Proximal Policy Optimization (EA-PPO) algorithm introduces a state modeling and value evaluation mechanism based on local environmental features. By leveraging compact, environment-related state representations, each monitoring node can make decentralized decisions with limited local observations, without requiring explicit global state information or intensive inter-node communication. This design helps control the growth of state dimensionality in multi-agent layout optimization problems and improves training stability and computational efficiency in complex environments. The overall training process of the algorithm is shown in Algorithm 1.

Algorithm 1. EA-PPO
Input: Terrain cost map, solar radiation map, and wind-speed map of the desert–Gobi–wasteland regions; maximum training iterations; total time steps T Output: Trained policy network Initialization: Initialize the policy network π and value network V corresponding to the agent; initialize replay buffer D
1:	$while k \in i t e r a t i o n$ do
2:	$for t = 1$ to T
3:	$Obtain the initial state s_{t}^{i}$
4:	$Obtain action a_{t}^{i}$ $from policy network π$ $given s_{t}^{i}$
5:	$Execute action a_{t}^{i}$ $and obtain reward r_{t}^{i}$
6:	$Store s_{t}^{i} {, a}_{t}^{i} {, r}_{t}^{i}$ $into replay buffer D$
7:	End for
8:	Update policy network π and value network V using data from replay buffer D
9:	End while
10:	$return the trained policy network π$

In Algorithm 1, Steps 2–6 correspond to data collection. In Steps 3–4, each task obtains an action

a_{t}^{i}

based on the state

s_{t}^{i}

and the policy network

π

. Steps 5–6 compute the reward

r_{t}^{i}

for each task using global environmental information, and the collected data are stored in the replay buffer

D

. In Step 8, the policy network

π

and value network

V

are updated using samples from buffer

D

. Finally, Step 10 outputs the fully trained policy network

π

.

4.3. Global Layout Optimization Algorithm Based on EA-PPO (GLOAE)

After completing the design and training of the EA-PPO algorithm, this study further extends it to a multi-agent cooperative decision-making framework, resulting in the Global Layout Optimization Algorithm based on EA-PPO (GLOAE). In this framework, the decision-making strategy learned by individual agents is deployed in a multi-agent setting, where multiple monitoring nodes interact with a shared environment and iteratively adjust their positions. At each decision step, all agents execute their policies in parallel and update their positions synchronously. After the layout update, a global utility function defined by the environment and deployment objective is evaluated based on the updated node configuration. Each agent then receives a reward derived from this global utility change, enabling coordination through a shared objective while preserving fully decentralized policy execution. Through coordinated policy execution and environment-level utility-guided feedback, GLOAE supports collaborative layout improvement under dynamic environmental conditions, thereby enhancing the overall coordination and utility of the monitoring network. The detailed algorithmic procedure of GLOAE is summarized in Algorithm 2.

Algorithm 2. GLOAE
Input: Terrain cost map, solar radiation map, and wind-speed map of the desert–Gobi–wasteland regions; EA-PPO algorithm; maximum training iterations; total runtime T Output: Final positions of all monitoring nodes Initialization: Initialize the initial positions of all monitoring nodes; initialize the policy network π and value network V for each agent; initialize replay buffer D
1:	$while k \in i t e r a t i o n$ do
2:	$for t \in T$
3:	$for agent s \in S$
4:	Perform decision-making for each agent individually using EA-PPO
5:	Update the positions of all agents synchronously
6:	Compute the reward of each agent based on the global utility function evaluated from the updated node layout
7:	$Each agent stores s_{t}^{i} {, a}_{t}^{i} {, r}_{t}^{i}$ into replay buffer D
8:	End for
9:	End for
10:	$Update the policy network π$ $and value network V$ of each agent using EA-PPO
11:	End while
12:	return the final node layout map

The above training process is built upon the EA-PPO framework and is executed in a dynamic environment where multiple monitoring nodes interact with the shared deployment space. During training, all nodes update their positions synchronously, which introduces dynamic changes in the environment as the layout evolves over time. Under this setting, agents are trained within a multi-agent deployment scenario using the same decision-making framework, and their interactions are implicitly reflected through the shared environment and utility feedback. After training is completed, the resulting node configuration represents the final layout obtained by the proposed method.

In the GLOAE framework, global coordination is achieved without introducing centralized critics or explicit global state sharing. Each monitoring node maintains its own policy and value networks and makes decisions based solely on local observations. The coupling among agents is implicitly realized through a shared utility formulation and synchronized environment interaction, allowing decentralized agents to collectively optimize the global layout objective while preserving scalability and computational efficiency.

The proposed framework avoids joint action enumeration by decoupling global optimization into local utility-driven decisions. Therefore, the effective state and action dimensions of each agent remain constant, and the overall computational cost scales approximately linearly with the number of nodes.

5. Experimental Results and Analysis

This section presents a series of simulation experiments conducted to evaluate the performance of the proposed algorithms. The experimental analysis is organized into two parts. The first part examines the determination of the number of monitoring nodes and provides a visualization of the resulting deployment layouts. The second part analyzes the convergence behavior and performance characteristics of the EA-PPO and GLOAEs.

5.1. Simulation Environment and Parameters

All simulation experiments in this section are implemented in Python 3.9 and executed on a desktop computer equipped with an AMD Ryzen 5 2600 CPU and 16 GB RAM. The main simulation parameters are summarized in Table 1.

The weighting coefficients

w_{1}

–

w_{4}

are introduced to balance the contributions of coverage performance, environmental benefit, deployment cost, and sensing redundancy in the utility function. These weights are selected based on engineering considerations and empirical testing to ensure that no single component dominates the optimization process. After normalization of each utility term, the adopted weight values represent a balanced trade-off among different objectives. Empirical observations indicate that moderate variations in these weights primarily affect convergence speed, while having limited influence on the final layout pattern.

The performance of reinforcement learning–based algorithms is generally influenced by hyperparameter selection, including learning rates and weighting coefficients. In this study, these parameters are chosen based on empirical experience and commonly used settings reported in related literature. Preliminary experiments show that moderate changes in learning rates mainly affect convergence speed and have little impact on the final converged utility. Similarly, reasonable variations in weighting coefficients do not lead to qualitative changes in layout behavior, indicating a certain degree of robustness of the proposed framework under the considered simulation settings. Both the policy and value networks adopt lightweight multilayer perceptron architectures with four hidden layers of 128 neurons and ReLU activation functions. All agents share the same network architecture, while their network parameters are trained independently. Under the experimental settings, a complete training process typically finishes within several minutes on a standard desktop computer.

5.2. Determination of the Number of Nodes and Result Visualization

First, the variation of overall utility under different numbers of monitoring nodes is analyzed to examine the influence of node scale on layout performance and to select an appropriate node configuration for subsequent experiments. The number of nodes is set to 10, 20, 30, and 40, respectively, and the corresponding results are shown in Figure 4. As illustrated in Figure 4, when the number of nodes is 10, the relatively small deployment scale leads to limited coverage redundancy. As a result, the overall utility exhibits stable behavior and converges rapidly, reaching convergence after approximately 8 training iterations. In this case, the initial utility is around 700 and increases to a converged value of approximately 900. When the number of nodes increases to 20, the initial utility rises to about 950, and the utility curve stabilizes after approximately 10–20 training iterations, converging to around 1600. The convergence process remains smooth, and the overall stability is comparable to that observed in the 10-node scenario. When the number of nodes further increases to 30 and 40, higher node density introduces more redundant coverage, which negatively affects the overall utility. Specifically, in the 30-node setting, the initial utility is approximately 650 and converges to around 1300 after about 20 training iterations, accompanied by increased fluctuations. In the 40-node case, the initial utility drops to roughly 450 and converges to about 800 after approximately 30 iterations, showing the most pronounced oscillations among all configurations. Overall, the experimental results indicate that a deployment with 20 monitoring nodes provides a favorable trade-off between convergence speed and achieved utility in the considered simulation setting. Moreover, across different node quantities, the algorithm consistently demonstrates convergent behavior, suggesting stable performance under varying deployment scales.

To further examine the layout characteristics of the proposed algorithm, a visualization analysis is conducted for the deployment results when the number of monitoring nodes is set to 20. During the visualization process, the previously constructed terrain cost map, solar radiation map, and wind speed intensity map are integrated, and the region is discretized into pixel-based grid blocks to provide an intuitive representation of node distribution. The corresponding deployment results are shown in Figure 5. As illustrated in Figure 5, the deployed monitoring nodes are primarily distributed in regions with relatively high observation value, while only limited local overlaps are observed in a few high-value areas. This spatial distribution reflects a reasonable trade-off between coverage and redundancy under the given environmental conditions. In regions with lower observation value, nodes are more sparsely distributed, and redundant coverage is largely avoided. Overall, the visualization results indicate that the generated layout achieves an organized spatial distribution while maintaining a high level of monitoring utility in the considered scenario.

5.3. Convergence Characteristics and Performance Analysis of the Algorithm

To further evaluate the performance characteristics of the proposed algorithm, comparative experiments are conducted using several representative reinforcement learning baselines. It is well recognized that the performance of deep reinforcement learning methods can be influenced by differences in implementation structures and architectural design. Therefore, to examine the performance impact introduced by the EA-PPO framework, multiple PPO-based implementations are selected for comparison. The baseline algorithms are denoted as PPO-1 [26], PPO-2 [27], and PPO-3 [28]. Among them, PPO-1 and PPO-2 represent commonly adopted PPO architectures in reinforcement learning applications, while PPO-3 incorporates structural modifications designed for multi-user or interactive environments, enabling more flexible policy adaptation under dynamic conditions. All algorithms are evaluated under identical environmental settings, and the evolution of the overall utility with respect to training iterations is recorded.

It should be noted that PPO-1, PPO-2, and PPO-3 represent different implementation variants of the standard PPO framework rather than distinct algorithms. PPO-1 corresponds to a classical PPO implementation, where agents make decisions based solely on local state information and learn from rewards that do not explicitly account for the influence of other agents. PPO-2 extends PPO-1 by adopting a global reward formulation while still relying on local state information, allowing agents to indirectly reflect overall layout performance during learning. PPO-3 further enhances the state representation by incorporating locally observable environmental information in addition to local states, improving agents’ perception of dynamic multi-agent environments, while its reward design still does not explicitly model inter-agent interactions. These PPO variants are designed to illustrate the performance differences arising from incremental structural modifications within the PPO family, rather than to serve as exhaustive benchmarks against all state-of-the-art MARL algorithms.

Figure 6 compares the convergence behavior of different PPO-based implementations in the multi-node layout optimization task. Under the same experimental settings, the baseline PPO variants exhibit distinct training dynamics. PPO-1 and PPO-2 show noticeable fluctuations throughout the training process, and their utility curves do not display a clear stabilization trend within the considered number of iterations. PPO-3, which incorporates architectural modifications for interactive environments, demonstrates improved convergence behavior and reaches a relatively stable state after approximately 17 training iterations. However, its utility curve still exhibits pronounced oscillations during both the early training stage and the post-convergence phase.

In contrast, the proposed EA-PPO algorithm exhibits a smoother utility evolution throughout training. The utility increases steadily during the early stage and remains relatively stable in the later phase, converging to a higher utility value under the same conditions. These results indicate that EA-PPO achieves improved training stability and more consistent performance compared with the selected PPO-based baselines in the considered layout optimization scenario.

To further evaluate the performance of the global layout optimization framework GLOAE, additional comparisons are conducted with three representative optimization methods: Genetic Algorithm (GA) [29], Greedy Algorithm [30], and Particle Swarm Optimization (PSO) [31]. For reproducibility and fair comparison, the key parameter settings of the baseline algorithms are summarized in Table 2. These parameters are selected based on commonly adopted values in the literature and preliminary empirical testing to ensure stable convergence. The corresponding results are shown in Figure 7. As illustrated, GA exhibits a relatively slow convergence process, requiring approximately 20–30 iterations to reach a stable utility level, while maintaining a smooth convergence trajectory with a final value of around 1400. The Greedy algorithm converges rapidly in the early stage and reaches a stable state at approximately the 7th iteration; however, its final utility level remains comparatively low, with a converged value of about 1200. PSO demonstrates rapid initial improvement followed by noticeable oscillations in the mid-stage, and gradually stabilizes in later iterations, achieving a final utility level of approximately 1400.

Under the same experimental conditions, GLOAE converges faster and achieves a higher final utility value than the selected baseline methods. From an empirical perspective, these results indicate that GLOAE provides a favorable balance between convergence behavior and achieved utility in the multi-node layout optimization task, demonstrating its suitability for collaborative layout optimization under dynamic environmental conditions.

Building upon the previous experiments, additional simulations are conducted under time-varying environmental conditions to examine the performance behavior of the algorithms when external disturbances are introduced. In practical desert–Gobi–wasteland scenarios, environmental factors such as terrain conditions, solar radiation, and wind characteristics may change over time. To reflect this variability, time-dependent perturbations are incorporated into the original simulation environment, resulting in a dynamic setting in which environmental states evolve continuously. The performance of each algorithm under these dynamic conditions is shown in Figure 8. As illustrated in Figure 8, the Greedy algorithm exhibits increased sensitivity to environmental changes, with its utility curve showing pronounced fluctuations and no clear stabilization trend under dynamic disturbances. The Genetic Algorithm demonstrates a noticeable reduction in convergence speed, requiring approximately 30 iterations to reach a relatively stable state, and its final utility level decreases compared with the static scenario, although the overall convergence trend remains smooth. The Particle Swarm Optimization algorithm shows enhanced oscillatory behavior in the presence of dynamic perturbations and gradually stabilizes after around 15 iterations, while still exhibiting considerable fluctuations in the later stages. Under the same dynamic conditions, the GLOAE maintains a comparatively stable convergence behavior. Although its convergence speed is slightly reduced compared with the static environment, the utility curve remains smooth throughout training, and the achieved utility level remains close to that observed previously. From an empirical perspective, these results suggest that GLOAE is less sensitive to time-varying environmental disturbances and can maintain consistent performance under dynamic conditions.

To more intuitively and quantitatively examine the influence of dynamic perturbations on the convergence behavior of each algorithm, the final 20 iterations before and after introducing perturbations are selected, and the mean and standard deviation during the convergence phase are computed to characterize convergence stability. The corresponding results are summarized in Table 3. The results in Table 3 are reported as mean values and standard deviations to illustrate convergence stability and performance trends under dynamic perturbations in a representative run, rather than formal statistical hypothesis testing. As shown in Table 3, the GLOAE exhibits relatively consistent convergence behavior after the introduction of dynamic perturbations. Its convergence mean decreases slightly from 1592.4 to 1588.1, while the standard deviation increases from 14.8 to 18.1, indicating a limited increase in variability under dynamic conditions. In comparison, the Greedy algorithm shows larger fluctuations, with the convergence mean increasing from 1238.6 to 1261.3 and the standard deviation rising from 58.8 to 70.2, suggesting increased sensitivity to environmental changes. The GA maintains a relatively stable convergence trend, with its convergence mean decreasing from 1420.9 to 1353.9 and the standard deviation increasing from 16.9 to 20.3, reflecting moderate sensitivity to dynamic perturbations. The PSO algorithm demonstrates the largest variation among the compared methods, with its convergence mean dropping from 1438.2 to 1332.5 and the standard deviation increasing from 36.2 to 55.6. Overall, the quantitative results indicate that GLOAE maintains comparatively smaller fluctuations during the convergence phase under dynamic perturbations.

While Table 3 illustrates convergence behavior before and after perturbations in a representative run, a single-run analysis alone cannot fully capture robustness under stochastic disturbances. To further statistically validate the comparative performance of different algorithms under dynamic perturbations, each method was independently executed 10 times with different random seeds. Table 4 reports the mean and standard deviation of the final performance obtained after introducing perturbations. A non-parametric Friedman test reveals statistically significant differences among the compared algorithms (χ² = 23.88, p < 0.001). Furthermore, post hoc Wilcoxon signed-rank tests confirm that GLOAE significantly outperforms GREED, GA, and PSO (all p < 0.01).

Based on the above experimental results, including multiple independent runs and statistical significance tests, the proposed GLOAE demonstrates statistically validated and more consistent performance under dynamic environmental conditions compared with the selected baseline methods. In dynamically changing scenarios with multi-source perturbations, GLOAE is able to maintain stable convergence behavior and achieve higher utility levels with reduced performance variability. These results indicate that the proposed approach is well suited for layout optimization problems involving time-varying environmental characteristics.

6. Conclusions

This study systematically investigates the layout optimization problem of meteorological monitoring networks in the complex and highly variable environments of desert–Gobi–wasteland regions. A multidimensional simulation environment integrating terrain conditions, solar radiation distribution, and wind speed characteristics is constructed to provide a controlled and representative platform for algorithm evaluation. Based on this environment, an Environment-Aware Proximal Policy Optimization (EA-PPO) algorithm is developed to support layout decision-making under complex environmental constraints. By adopting compact environment-related state representations and utility-guided feedback, EA-PPO improves training stability and learning efficiency in monitoring node deployment tasks. Furthermore, a Global Layout Optimization Algorithm based on EA-PPO (GLOAE) is proposed to enable coordinated layout optimization among multiple monitoring nodes through interaction with a shared environment. Extensive simulation experiments are conducted to evaluate the performance of the proposed methods under both static and dynamically changing environmental conditions. The results demonstrate that EA-PPO and GLOAE achieve competitive utility levels and exhibit stable convergence behavior compared with the selected baseline methods. In dynamic scenarios with time-varying perturbations, the proposed framework maintains consistent performance across different environmental settings.

Overall, this work provides an effective simulation-based framework and an engineering-oriented optimization approach for monitoring network layout design in desert–Gobi–wasteland environments. The findings offer practical insights that may facilitate the development of intelligent and robust deployment strategies for meteorological monitoring networks operating in complex natural conditions.

Author Contributions

Conceptualization, methodology, Z.H.; software, validation, Q.L., R.L. and Z.X.; formal analysis, Z.H. and R.L.; investigation, Q.L. and Z.X.; resources, Z.H.; data curation, Q.L., R.L. and Z.X.; writing—original draft preparation, Z.H. and R.L.; writing—review and editing, J.H.; visualization, Q.L.; supervision, J.H.; project administration, Z.H.; funding acquisition, Z.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Major Science and Technology Project of Gansu Province, grant number 24ZDGA003. The funder was not involved in the study design, collection, analysis, interpretation of data, the writing of this article or the decision to submit it for publication.

Data Availability Statement

Data are contained within the article.

Acknowledgments

We would like to express our gratitude to the editors and reviewers for their valuable assistance and constructive feedback in the review and publication of this article.

Conflicts of Interest

Author Runxiang Li was employed by the company Lanzhou Dafang Electronics Co., Ltd. All of the authors, Zifen Han, QingQuan Lv, Zhihua Xie, Runxiang Li, and Jiuyuan Huo, declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as potential conflicts of interest.

References

Dong, S. How to Accelerate the Construction of “Shagehuang” New Energy Bases. People’s Daily, 6 April 2025. [Google Scholar]
Qiao, D. Qinghai Assesses the Power Generation Potential of the “Shagehuang” Region. People’s Daily, 13 January 2025. [Google Scholar]
Li, S.; Ouyang, Z.; Sun, J.; Ma, R.; Wang, W. Day-Ahead Low-Carbon Dispatch Strategy of Power Systems for New Energy Consumption in “Shagehuang” Regions. Acta Energiae Solaris Sin. 2024, 45, 82–91. [Google Scholar]
Ma, Y.; Xie, K.; Zhao, Y.; Yang, H.; Zhang, D. Bi-objective layout optimization for multiple wind farms considering sequential fluctuation of wind power using uniform design. CSEE J. Power Energy Syst. 2021, 8, 1623–1635. [Google Scholar] [CrossRef]
Lei, Z.; Gao, S.; Zhang, Z.; Yang, H.; Li, H. A Chaotic Local Search-Based Particle Swarm Optimizer for Large-Scale Complex Wind Farm Layout Optimization. IEEE/CAA J. Autom. Sin. 2023, 10, 1168–1180. [Google Scholar] [CrossRef]
Wu, J.; Fang, X.; Niyato, D.; Wang, J.; Wang, J. DRL Optimization Trajectory Generation via Wireless Network Intent-Guided Diffusion Models for Optimizing Resource Allocation. arXiv 2024, arXiv:2410.14481. [Google Scholar] [CrossRef]
Zhao, T.; Li, F.; He, L. DRL-based joint resource allocation and device orchestration for hierarchical federated learning in NOMA-enabled industrial IoT. IEEE Trans. Ind. Inform. 2022, 19, 7468–7479. [Google Scholar]
Qin, P.; Fu, Y.; Zhang, J.; Geng, S.; Liu, J.; Zhao, X. DRL-Based Resource Allocation and Trajectory Planning for NOMA-Enabled Multi-UAV Collaborative Caching 6G Network. IEEE Trans. Veh. Technol. 2024, 73, 8750–8764. [Google Scholar] [CrossRef]
Liu, W.; Bai, Y.; Yue, X.; Wang, R.; Song, Q. A wind speed forcasting model based on rime optimization based VMD and multi-headed self-attention-LSTM. Energy 2024, 294, 130726. [Google Scholar] [CrossRef]
Zhang, H.; Yang, M.; Yang, Y.; Liu, H.; Lin, Y. Collaborative Layout Optimization for Ship Pipes Based on Spatial Vector Coding Technique. IEEE Access 2023, 11, 116762–116785. [Google Scholar] [CrossRef]
Guo, N.; Zhang, M.; Li, B.; Cheng, Y. Influence of Atmospheric Stability on Wind Farm Layout Optimization Based on an Improved Gaussian Wake Model. J. Wind. Eng. Ind. Aerodyn. 2021, 211, 104548. [Google Scholar] [CrossRef]
Zhang, Z.; Yu, Q.; Yang, H.; Li, J.; Cheng, J.; Gao, S. Triple-Layered Chaotic Differential Evolution Algorithm for Layout Optimization of Offshore Wave Energy Converters. Expert Syst. Appl. 2024, 239, 122439. [Google Scholar] [CrossRef]
Zhang, Y.; Kou, X.; Song, Z.; Fan, Y.; Usman, M.; Jagota, V. Research on Logistics Management Layout Optimization and Real-Time Application Based on Nonlinear Programming. Nonlinear Eng. 2021, 10, 526–534. [Google Scholar] [CrossRef]
Xu, L.; Xu, B.; Su, J. Facilities Layout Design Optimization of Production Workshop Based on the Improved PSO Algorithm. IEEE Access 2024, 12, 112025–112037. [Google Scholar] [CrossRef]
Daqaq, F.; Ellaia, R.; Ouassaid, M.; Zawbaa, H.M.; Kamel, S. Enhanced Chaotic Manta Ray Foraging Algorithm for Function Optimization and Optimal Wind Farm Layout Problem. IEEE Access 2022, 10, 78345–78369. [Google Scholar] [CrossRef]
Shan, H.; Jiang, K.; Xing, J.; Jiang, T. BPSO and staggered triangle layout optimization for wideband RCS reduction of pixelate checkerboard metasurface. IEEE Trans. Microw. Theory Tech. 2022, 70, 3406–3414. [Google Scholar] [CrossRef]
Silva, I.F.D.; Lazzarin, T.B.; Schmitz, L.; Panisson, A.R. Genetic Algorithm-Based Optimization Framework for Offshore Wind Farm Layout Design. IEEE Access 2025, 13, 170081–170094. [Google Scholar] [CrossRef]
Chemim, L.S.; Nicolle, C.S.; Kleina, M. Layout optimization methods and tools: A systematic literature review. Gepros Gestão Produção Operações Sist. 2021, 16, 59. [Google Scholar] [CrossRef]
Li, Y.; Ma, M.; Wu, B. Application of Deep Reinforcement Learning in Heterogeneous Sensor Networks. Res. Sq. 2024. [Google Scholar] [CrossRef]
Zhang, H.; Ge, Y.; Zhao, X.; Wang, J. Hierarchical Deep Reinforcement Learning for Multi-Objective Integrated Circuit Physical Layout Optimization with Congestion-Aware Reward Shaping. IEEE Access 2025, 13, 162533–162551. [Google Scholar] [CrossRef]
Zhou, P.; Kan, M.; Chen, W.; Wang, Y.; Cao, B. An Adaptive Coverage Method for Dynamic Wireless Sensor Network Deployment Using Deep Reinforcement Learning. Sci. Rep. 2025, 15, 30304. [Google Scholar] [CrossRef] [PubMed]
Pushpa, G.; Babu, R.A.; Subashree, S.; Senthilkumar, S. Optimizing Coverage in Wireless Sensor Networks Using Deep Reinforcement Learning with Graph Neural Networks. Sci. Rep. 2025, 15, 16681. [Google Scholar] [CrossRef]
Chowdhuri, R.; Barma, M.K.D. Node position estimation based on optimal clustering and detection of coverage hole in wireless sensor networks using hybrid deep reinforcement learning: R. Chowdhuri, MKD Barma. J. Supercomput. 2023, 79, 20845–20877. [Google Scholar] [CrossRef]
Yang, F.; He, Q.; Huang, J.; Mamtimin, A.; Yang, X.; Huo, W.; Zhou, C.; Liu, X.; Wei, W.; Cui, C.; et al. Desert Environment and Climate Observation Network over the Taklimakan Desert. Bull. Am. Meteorol. Soc. 2020, 102, E1172–E1191. [Google Scholar] [CrossRef]
Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games. Adv. Neural Inf. Process. Syst. 2022, 35, 24611–24624. [Google Scholar]
Meng, C.; Xiong, K.; Chen, W.; Gao, B.; Fan, P.; Letaief, K.B. Sum-Rate Maximization in STAR-RIS-Assisted RSMA Networks: A PPO-Based Algorithm. IEEE Internet Things J. 2024, 11, 5667–5680. [Google Scholar] [CrossRef]
An, H.; Wang, L. Robust Topology Generation of Internet of Things Based on PPO Algorithm Using Discrete Action Space. IEEE Trans. Ind. Inform. 2024, 20, 5406–5414. [Google Scholar] [CrossRef]
Zhang, H.; Jiang, M.; Liu, X.; Wen, X.; Wang, N.; Long, K. PPO-Based PDACB Traffic Control Scheme for Massive IoV Communications. IEEE Trans. Intell. Transport. Syst. 2023, 24, 1116–1125. [Google Scholar] [CrossRef]
Coello, C.A. An Updated Survey of GA-Based Multiobjective Optimization Techniques. ACM Comput. Surv. 2000, 32, 109–143. [Google Scholar] [CrossRef]
García, A. Greedy Algorithms: A Review and Open Problems. J. Inequal. Appl. 2025, 2025, 11. [Google Scholar] [CrossRef]
Meng, Q.; Chen, K.; Qu, Q. PPSwarm: Multi-UAV Path Planning Based on Hybrid PSO in Complex Scenarios. Drones 2024, 8, 192. [Google Scholar] [CrossRef]

Figure 1. Terrain Cost Map.

Figure 2. Solar Radiation Map.

Figure 3. Wind Speed Intensity Map.

Figure 4. Convergence Results under Different Numbers of Nodes.

Figure 5. Node layout results (nodes = 20).

Figure 6. Experimental results of different PPO design schemes.

Figure 7. Comparison of overall optimization performance among different algorithms.

Figure 8. Performance comparison of algorithms under dynamic environmental conditions.

Table 1. Parameters of experimental environment.

Parameter	Value
w₁	0.15
w₂	0.25
w₃	0.2
w₄	0.4
epochs	50
T	100
optimizer	Adam
learning rate	0.01
discount factor	0.98
clipping value	0.2
batch size	2000

Table 2. Key parameter settings of comparison algorithms.

Algorithm	Parameter	Value
GA	Population size	50
	Crossover rate	0.8
	Mutation rate	0.05
PSO	Swarm size	50
	Cognitive coefficient	2.0
	Social coefficient	2.0
	Inertia weight	0.7
Greedy	Strategy	Deterministic

Table 3. Convergence Performance of Different Algorithms Before and After Adding Perturbations.

Algorithm	Convergence Mean (Before Perturbation)	Standard Deviation	Convergence Mean (After Perturbation)	Standard Deviation
GLOAE	1592.4	14.8	1588.1	18.1
GREED	1238.6	58.8	1261.3	70.2
GA	1420.9	16.9	1353.9	20.3
PSO	1438.2	36.2	1332.5	55.6

Table 4. Statistical Comparison of Different Algorithms under Dynamic Perturbations.

Algorithm	Mean	Standard Deviation
GLOAE	1587.3	19.0
GREED	1245.6	123.9
GA	1358.7	41.8
PSO	1368.2	48.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Han, Z.; Lv, Q.; Xie, Z.; Li, R.; Huo, J. Optimization of Monitoring Node Layout in Desert–Gobi–Wasteland Regions Based on Deep Reinforcement Learning. Symmetry 2026, 18, 237. https://doi.org/10.3390/sym18020237

AMA Style

Han Z, Lv Q, Xie Z, Li R, Huo J. Optimization of Monitoring Node Layout in Desert–Gobi–Wasteland Regions Based on Deep Reinforcement Learning. Symmetry. 2026; 18(2):237. https://doi.org/10.3390/sym18020237

Chicago/Turabian Style

Han, Zifen, Qingquan Lv, Zhihua Xie, Runxiang Li, and Jiuyuan Huo. 2026. "Optimization of Monitoring Node Layout in Desert–Gobi–Wasteland Regions Based on Deep Reinforcement Learning" Symmetry 18, no. 2: 237. https://doi.org/10.3390/sym18020237

APA Style

Han, Z., Lv, Q., Xie, Z., Li, R., & Huo, J. (2026). Optimization of Monitoring Node Layout in Desert–Gobi–Wasteland Regions Based on Deep Reinforcement Learning. Symmetry, 18(2), 237. https://doi.org/10.3390/sym18020237

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Optimization of Monitoring Node Layout in Desert–Gobi–Wasteland Regions Based on Deep Reinforcement Learning

Abstract

1. Introduction

2. Related Work

2.1. Mathematical Optimization Methods

2.2. Intelligent Optimization Methods

2.3. Deep Reinforcement Learning–Driven Intelligent Layout Methods

3. Simulation Environment Construction and Problem Modeling

3.1. Simulation Modeling of the Desert–Gobi–Wasteland Regions

3.1.1. Terrain Cost Map

3.1.2. Solar Radiation Map

3.1.3. Wind Speed Intensity Map

3.2. Simulation Region and Node Sensing Model

3.3. Problem Description

4. Algorithm Design and Implementation

4.1. Overview of the PPO Algorithm

4.2. Environment-Aware Layout Optimization Algorithm: EA-PPO

4.2.1. MDP Modeling

4.2.2. EA-PPO Algorithm Description

4.3. Global Layout Optimization Algorithm Based on EA-PPO (GLOAE)

5. Experimental Results and Analysis

5.1. Simulation Environment and Parameters

5.2. Determination of the Number of Nodes and Result Visualization

5.3. Convergence Characteristics and Performance Analysis of the Algorithm

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI