Saved Queries

Digitalizing essential services opens up a new risk of exposing critical infrastructure to botnet infections. In a grid topology network, the neighbor-to-neighbor paths can be used by the malicious botnet to spread the infection. Previous white-hat worm launchers used heuristics and supervised learning to exterminate botnets, which demand specific conditions or a suitable dataset to be effective. Although reinforcement learning addressed these issues, it requires a longer time to train. This article proposes a framework to shorten training and improve the effectiveness of reinforcement learning. The framework applies four key principles: (1) surveying the network status with multi-tensor input, (2) removing irrelevant actions via a novel Chebyshev-based masking strategy, (3) reinforcing key actions with rewards, and (4) optimizing rewards for winning. Four reinforcement learning algorithms are implemented to evaluate the framework, which are vanilla policy gradient, deep Q-network, proximal policy optimization, and MuZero in a stylized grid topology network simulation. An ablation study indicates that the masking used in identify accounts for the majority of the improvement, whereas multi-channel in Survey alone can reduce performance without complementary masking, rewards, and optimization. With the mean winning rate improved by 49.129% and mean win efficiency improved by 118.8031% against our previous work, the framework effectiveness is confirmed in stylized simulations. Full article

(This article belongs to the Special Issue The Forefront of Internet of Things Cybersecurity with Artificial Intelligence)

29 pages, 488 KB

Open AccessReview

Glucagon-like Peptide-1 and Dual GIP/GLP-1 Receptor Agonists in Brain: Exploring the Expanding Role and Safety in Neuropsychiatry

by Ana Cristina Tudosie, Loredana-Maria Marin, Simona Georgiana Popa and Andreea Loredana Golli

Int. J. Mol. Sci. 2026, 27(8), 3628; https://doi.org/10.3390/ijms27083628 (registering DOI) - 18 Apr 2026

Viewed by 42

Abstract

Glucagon-like peptide-1 (GLP-1) and dual GIP/GLP-1 receptor agonists, originally introduced for the management of type 2 diabetes mellitus and obesity, are increasingly recognized for their broader actions within the central nervous system, with emerging implications in neuropsychiatry and neurodegeneration. This review integrates current preclinical and clinical evidence, emphasizing their pharmacodynamic profile, central receptor distribution, and the molecular pathways linking metabolic signaling to neural function. Evidence suggests that GLP-1 receptor activation across key brain regions involved in energy balance and reward modulates multiple neurotransmitter systems, including dopamine and serotonin, as well as glutamatergic and GABAergic transmission, thereby influencing behavior, affective processes, and cognitive function. In parallel, these agents exhibit neuroprotective properties through improved neuronal insulin sensitivity, attenuation of neuroinflammatory pathways, and support of neuroplasticity, alongside effects on limiting pathological protein aggregation. Dual GIP/GLP-1 agonism may further potentiate these central actions through complementary metabolic and synaptic mechanisms. Although pharmacovigilance data have identified isolated neuropsychiatric adverse events, current clinical evidence does not support a consistent causal association. Collectively, incretin-based therapies represent a promising translational approach at the interface of metabolic and neuropsychiatric disorders, warranting further investigation into their long-term central safety, therapeutic efficacy, and clinical relevance. Full article

(This article belongs to the Special Issue Role of the Gut-Islet Axis in and Beyond Metabolic Diseases)

33 pages, 3976 KB

Open AccessArticle

Threat Conditioning Prior to Cocaine or Sucrose Exposure Alters Reward-Seeking Behavior in a Sex-Dependent Manner

by Yobet Perez-Perez, Roberto J. Morales-Silva, Genesis N. Rodriguez-Torres, Rafael III Ruiz-Villalobos, Jose C. Rivera-Velez, Edgardo G. Arlequin-Torres, Elaine M. Vera-Torres, Lenin J. Godoy-Muñoz, Serena I. Fazal, Nilenid Rivera-Aviles, Sofia Neira and Marian T. Sepulveda-Orengo

Psychiatry Int. 2026, 7(2), 85; https://doi.org/10.3390/psychiatryint7020085 (registering DOI) - 18 Apr 2026

Viewed by 82

Abstract

Background/Objectives: Research has shown a high prevalence of co-occurring trauma-related disorders and cocaine use disorder (CUD). However, there remains a need for preclinical studies to determine how traumatic event exposure influences vulnerability to CUD development and relapse. In this study, we assessed the impact of traumatic event exposure using a threat conditioning (TC) paradigm, which models traumatic event exposure through associative threat learning on cocaine-seeking behavior in adult male and female rats. Methods: Adult male and female rats were exposed to a single TC session. After TC, the rats underwent cocaine self-administration (SA), extinction training, cue-primed reinstatement, and cocaine-primed reinstatement testing. A parallel cohort was subjected to a sucrose SA cohort to assess whether TC altered non-drug reward seeking in the form of sucrose SA. Results: In the cocaine cohort, stressed male rats exhibited greater cue- and cocaine-primed reinstatement relative to non-stressed males, whereas no reinstatement differences emerged in female rats. In the sucrose cohort, stressed females displayed increased sucrose pellet delivery during self-administration compared to non-stressed females, but no differences were observed during sucrose reinstatement in either male or female rats. Conclusions: These findings indicate that trauma exposure prior to cocaine use influences cocaine relapse-related behavior, as well as non-drug reward reinforcement earning, in a sex-specific manner. Overall, these results highlight the value of associative stress models such as TC for studying trauma–addiction comorbidity and the need to investigate the neurobiological mechanisms driving these sex-specific outcomes. Full article

(This article belongs to the Section Addiction Psychiatry)

►▼ Show Figures

Figure 1

22 pages, 876 KB

Open AccessArticle

Large Autonomous Driving Overtaking Decision and Control System Based on Hierarchical Reinforcement Learning

by Chen-Ning Wang and Xiuhui Tang

Electronics 2026, 15(8), 1711; https://doi.org/10.3390/electronics15081711 - 17 Apr 2026

Viewed by 96

Abstract

To address the bottlenecks of low sample efficiency and poor control accuracy in traditional single-layer reinforcement learning during autonomous driving overtaking, this paper proposes an overtaking decision and control system based on hierarchical reinforcement learning to decouple complex tasks in spatial and temporal dimensions. A heterogeneous two-layer architecture is constructed, where the upper layer adopts the Proximal Policy Optimization algorithm to generate macroscopic discrete decisions, while the lower layer employs Twin Delayed Deep Deterministic Policy Gradient combined with Long Short-Term Memory to achieve smooth continuous control of steering and acceleration by perceiving temporal features of dynamic obstacles. A composite reward mechanism, integrating hard safety constraints and soft efficiency incentives, is designed to balance safety, efficiency, and comfort. Experimental results in complex scenarios with multiple interfering vehicles and random lane-changing behaviors demonstrate that the proposed system improves the training convergence speed by approximately 30% within 500,000 steps compared to single-layer algorithms. In tests across varying traffic densities, the system achieves a 98.3% success rate in medium-density scenarios with a collision rate of only 0.6%. In high-density challenges, the success rate remains above 95%, with the collision rate reduced by about 80% compared to baseline models. Furthermore, the lateral control deviation is strictly limited to within 0.2 m, and the longitudinal safety distance remains stable above 5 m. This system provides a robust, high-efficiency paradigm for autonomous overtaking. Full article

19 pages, 580 KB

Open AccessArticle

Emergent Pedestrian Safety in a World-Model Driving Agent Under Adversarial Interaction Without Explicit Safety Rewards

by Stefan Zlatinov, Gorjan Nadzinski, Vesna Ojleska Latkoska, Dushko Stavrov and Mile Stankovski

Appl. Sci. 2026, 16(8), 3915; https://doi.org/10.3390/app16083915 - 17 Apr 2026

Viewed by 110

Abstract

Pedestrian interaction remains a central safety challenge for autonomous driving, particularly under non-compliant or adversarial pedestrian behavior. Existing research and evaluations predominantly test against rule-following pedestrians, leaving a gap in understanding how learning-based agents handle worst-case interactions. We introduce the Jaywalkers Library, a novel configurable benchmark in CARLA with three adversarial pedestrian archetypes (Intruder, Indecisive Crosser, and Protester). We evaluate a DreamerV3 agent trained with sparse rewards, where the only pedestrian-specific signal is a terminal collision penalty. Evaluation employs a frozen-policy protocol with explicit train–test separation. Safety behavior is decomposed into endpoint outcomes, evasion dynamics, and efficiency costs. Under nominal conditions, the agent achieves high route completion and generalizes to an unseen town, whereas under adversarial exposure, an archetype-sensitive evasion strategy emerges. The agent swerves at speed against dynamic pedestrians but decelerates against the slow-moving Protester. Collision rates reveal a counterintuitive difficulty ordering in which the Protester is the hardest, followed by the Intruder, with the Indecisive Crosser as the most survivable. These findings show that a sparse terminal penalty suffices for emergent pedestrian avoidance in a world-model agent, but that effectiveness is bounded by the world model’s ability to predict pedestrian persistence. Full article

(This article belongs to the Special Issue Advances in Virtual Reality and Vision for Driving Safety)

25 pages, 1876 KB

Open AccessArticle

Ketogenic Diet Promotes Reward Learning by Upregulating Hippocampal CAMK2A Expression and Activating Dopamine Synaptic Signaling

by Yanan Qiao, Yubing Zeng, Chen Chen, Jinying Shen, Yi Wang, Pei Pei and Shan Wang

Int. J. Mol. Sci. 2026, 27(8), 3587; https://doi.org/10.3390/ijms27083587 - 17 Apr 2026

Viewed by 83

Abstract

Various neuromodulatory benefits of the ketogenic diet (KD) have been demonstrated, yet its influence on reward learning and underlying mechanisms remain poorly defined. This study combined proteomics and metabolomics to identify key molecular changes in the hippocampus of KD-fed mice. Our analysis revealed significant upregulation of the “dopaminergic synapse” pathway, with CAMK2A emerging as a central regulator. In vitro, treatment of the hippocampal neuronal cell line HT22 with β-hydroxybutyrate (BHB), a primary KD metabolite, increased the protein expression of CAMK2A and increased the phosphorylation of its downstream target, GluA1. Crucially, Camk2a knockdown completely blocked BHB-induced p-GluA1 enhancement. To determine the behavioral relevance, we stereotaxically delivered AAV-shCamk2a into the hippocampus of KD-fed mice. Knockdown of Camk2a reversed the pro-reward effects of KD, as measured by the sucrose preference test and conditioned place preference test, without impairing general locomotor activity in the open field test. Together, these results suggest a novel BHB–CAMK2A–dopaminergic signaling axis through which KD enhances reward learning, thus bridging systemic metabolism with cognitive function and expanding our understanding of KD-mediated neuromodulation. Full article

(This article belongs to the Section Bioactives and Nutraceuticals)

19 pages, 3326 KB

Open AccessArticle

Energy-Harvesting-Assisted UAV Swarm Anti-Jamming Communication Based on Multi-Agent Reinforcement Learning

by Yongfang Li, Tianyu Zhao, Zhijuan Wu, Yan Lin and Yijin Zhang

Drones 2026, 10(4), 294; https://doi.org/10.3390/drones10040294 - 16 Apr 2026

Viewed by 117

Abstract

Considering that the unmanned aerial vehicles (UAVs) are susceptible to both co-channel interference and malicious jamming with limited onboard battery energy, this paper proposes an energy-harvesting-assisted anti-jamming communication framework for UAV swarm networks. Specifically, we first model the problem as a decentralized partially observable Markov decision process (Dec-POMDP), aiming to achieve a long-term trade-off between data transmission success rate and energy consumption. Then we propose a multi-agent independent advantage actor–critic (IA2C)-based energy-harvesting-assisted anti-jamming communication solution, which enables each cluster head (CH) to learn its transmit channel, power, and energy harvesting time policy independently. By constructing a time-space-based extended Dec-POMDP, the spatiotemporal correlations among neighboring nodes are learned by allowing adjacent agents to share discounted local observations. Extensive simulations show that, compared with the benchmark schemes, the proposed scheme improves the average cumulative reward and average cumulative success rate by 17.26% and 10.37%, respectively, while achieving a higher transmission success rate with lower energy consumption under different numbers of available channels. Full article

(This article belongs to the Special Issue Intelligent Spectrum Management in UAV Communication)

25 pages, 5262 KB

Open AccessArticle

A Novel and Optimal Reservoir Operation Model Incorporating Inflow Forecasts Based on Deep Reinforcement Learning Algorithms

by Xin Xiang, Shenglian Guo, Bokai Sun, Xiaoya Wang, Le Guo and Zhiming Liang

Water 2026, 18(8), 948; https://doi.org/10.3390/w18080948 - 16 Apr 2026

Viewed by 242

Abstract

Deep reinforcement learning (DRL) has been increasingly used in reservoir operation, but several key challenges and limitations need further study. This paper developed a novel and optimal reservoir operation model incorporating inflow forecasts based on DRL and the deterministic policy gradient algorithm. A multi-dimensional reward function was derived from the objective functions and constraints, and an optimal scheduling scheme was established with dynamically weighted reward functions. The observed daily flow data and 5-day inflow forecasts of the Three Gorges Reservoir (TGR) during flood seasons (from 10 June to 31 October) from 2010 to 2025 were used to evaluate the model performance and compared with the actual operation results. The results show that, compared with the actual operation, Scheme-1 with dynamic weights increases annual average flood prevention storage capacity by approximately 36.8%, enhances power generation by about 2.86 billion kW·h (≈5.49%), and reduces spillway waste water volume by around 3.33 billion m³. This study demonstrates that the optimal scheduling model can substantially improve the overall efficiency of reservoir operation, and the improvement is even more pronounced when the reward function weights are set dynamically. Full article

(This article belongs to the Section New Sensors, New Technologies and Machine Learning in Water Sciences)

►▼ Show Figures

Figure 1

26 pages, 1456 KB

Open AccessArticle

Artificial Intelligence-Based Decision Support System for UAV Control in a Simulated Environment

by Przemysław Sujecki and Damian Frąszczak

Sensors 2026, 26(8), 2436; https://doi.org/10.3390/s26082436 - 15 Apr 2026

Viewed by 204

Abstract

Unmanned aerial vehicles (UAVs) are increasingly deployed in missions that require high autonomy and reliable decision-making; however, many operational concepts still assume access to GNSS and stable communication with a human operator. In contested environments, this assumption may no longer hold because GNSS degradation, radio-frequency interference, and intentional jamming can disrupt positioning and communication, thereby reducing mission effectiveness and safety. Recent surveys show that operation in GNSS-denied environments remains a major challenge and often requires alternative perception, localization, and control strategies. In response, this article investigates a reinforcement learning (RL)-based decision-support system for the autonomous control of a quadrotor UAV in a three-dimensional simulated environment. Rather than following pre-programmed waypoints, the UAV learns a control policy through interaction with the environment and reward-driven adaptation. The proposed system is designed for mission execution under uncertainty, limited external guidance, and partial observability. Two policy-gradient approaches are implemented and compared: classical REINFORCE and Proximal Policy Optimization (PPO) with an Actor–Critic architecture. The study presents the simulation environment, state and action representation, reward formulation, staged training procedure, and comparative evaluation. The results indicate that, within the considered unseen test scenario, the PPO-based configuration achieved higher mission effectiveness than REINFORCE in the final unseen test scenario, supporting the practical relevance of structured deep reinforcement learning for UAV operation in GPS-denied and communication-constrained environments. Full article

(This article belongs to the Special Issue UAVs as Mobile Sensing Platforms: Advances, Innovations, and Emerging Applications)

30 pages, 1499 KB

Open AccessArticle

Environment-Aware Optimal Placement and Dynamic Reconfiguration of Underwater Robotic Sonar Networks Using Deep Reinforcement Learning

by Qiming Sang, Yu Tian, Jin Zhang, Yuyang Xiao, Zhiduo Tan, Jiancheng Yu and Fumin Zhang

J. Mar. Sci. Eng. 2026, 14(8), 733; https://doi.org/10.3390/jmse14080733 - 15 Apr 2026

Viewed by 134

Abstract

Underwater dynamic target detection, classification, localization, and tracking (DCLT) is central to maritime surveillance and monitoring and increasingly relies on distributed AUV-based robotic sonar networks operating in passive listening and, when required, cooperative multistatic modes. Achieving a robust performance in realistic oceans remains challenging, because sensor placement must adapt to time-varying acoustic conditions and target priors while preserving acoustic communication connectivity, and because frequent reconfiguration under dynamic currents makes classical large-scale planning computationally expensive. This paper presents an integrated deep reinforcement learning (DRL)-based framework for passive-stage sonar placement and dynamic reconfiguration in distributed AUV networks. First, we cast placement as a constructive finite-horizon Markov decision process (MDP) and train a Proximal Policy Optimization (PPO) agent to sequentially build a collision-free layout on a discretized surveillance grid. The terminal reward is formulated to jointly optimize the environment-aware detection performance, computed from BELLHOP-based transmission loss models, and global network connectivity, quantified using algebraic connectivity. Second, to enable time-critical reconfiguration, we estimate flow-aware motion costs for all AUV–destination pairs using a PPO with a Long Short-Term Memory (LSTM) trajectory policy trained for partial observability. The learned policy can be deployed onboard, allowing each AUV to refine its path online using locally sensed currents, improving robustness to ocean-model uncertainty. The resulting cost matrix is solved via an efficient zero-element assignment method to obtain the optimal one-to-one reassignment. In the reported simulation studies, the proposed Sequential PPO placement method achieves a final reward 16–21% higher than Particle Swarm Optimization (PSO) and 2–3.7% higher than the Genetic Algorithm (GA), while the proposed PPO + LSTM planner reduces average travel time by 30.44% compared with A*. The proposed closed-loop architecture supports frequent re-optimization, scalable fleet operation, and a seamless transition to communication-supported cooperative multistatic tracking after detection, enabling efficient, adaptive DCLT in dynamic marine environments. Full article

(This article belongs to the Section Ocean Engineering)

27 pages, 1420 KB

Open AccessArticle

Synergistic Governance of Pollution Reduction and Carbon Mitigation Through Air Quality Ecological Compensation: Evidence from China

by Zhuo Chen and Qingxuan Bu

Sustainability 2026, 18(8), 3909; https://doi.org/10.3390/su18083909 - 15 Apr 2026

Viewed by 217

Abstract

Atmospheric pollutants and CO₂ share common origins in fossil fuel combustion, raising the question of whether fiscal incentives targeting air quality alone can indirectly reduce carbon emissions. This study examines this question by evaluating China’s air quality ecological compensation policy, a provincial-level horizontal fiscal transfer mechanism under which cities are rewarded or penalized according to changes in ambient air quality indicators, without incorporating any explicit carbon-related assessment criteria. Using panel data from 268 prefecture-level cities over 2007–2023 and a multi-period difference-in-differences design, we find that the policy significantly reduces the composite pollution carbon index (β = −0.213, p < 0.01), with the effect confirmed by an alternative weighted-average specification (β = −0.153, p < 0.01) and robust to propensity score matching, one-period lagged regression, exclusion of provincial-level municipalities, and exclusion of the COVID-19 period. A two-step mechanism analysis, adopted to avoid post-treatment bias from “bad controls,” reveals that the policy promotes industrial structure upgrading (β = 0.253, p < 0.01), enhances green technological innovation capacity (β = 0.047, p < 0.10), and reduces energy consumption intensity (β = −0.012, p < 0.01). Heterogeneity analysis based on quartile subsamples shows that the synergistic benefits concentrate in cities with stronger fiscal capacity (β = −0.349, p < 0.01 versus insignificant for low-support cities), higher economic development, and greater urbanization (β = −1.558, p < 0.01 for highly urbanized cities), while the policy effect is statistically insignificant in the least-advantaged subgroups across these three dimensions. In contrast, the green coverage dimension reveals an opposite pattern: the effect is strongest in cities with lower green coverage (β = −0.378, p < 0.05) and insignificant in high-coverage cities, indicating diminishing marginal returns where environmental baselines are already favorable. These findings highlight the need for differentiated compensation standards, including tiered compensation coefficients and targeted fiscal support for resource-constrained regions, to ensure equitable governance outcomes. Full article

22 pages, 1136 KB

Open AccessArticle

Co-Optimized Scheduling of a Multi-Microgrid System Based on a Reputation Point Trading Mechanism

by Jiankai Fang, Dongmei Yan, Hongkun Wang, Hui Deng, Xinyu Meng and Hong Zhang

Smart Cities 2026, 9(4), 69; https://doi.org/10.3390/smartcities9040069 - 15 Apr 2026

Viewed by 217

Abstract

With the rapid integration of distributed energy resources, achieving a balance between economic efficiency and environmental sustainability in multi-microgrid (MMG) systems is critical. However, existing studies typically treat microgrid operators as fully compliant entities. They often neglect the “trust-risk” dimension along with potential default behaviors in decentralized markets. This paper proposes a novel co-optimized scheduling model for urban MMG systems, centered on a unified “Social–Economic–Physical” coupling framework. To ensure transaction integrity, a robust reputation evaluation framework is developed using Root Mean Square Error (RMSE), mean absolute error (MAE), plus Dynamic Time Warping (DTW). This framework effectively identifies fraudulent data or contractual breaches. Furthermore, to enhance fairness while promoting decarbonization, the model integrates a dynamic network pricing strategy based on the Shapley value. It works alongside a reputation-weighted reward–penalty step-type carbon trading scheme. The proposed model is formulated as a mixed-integer linear programming (MILP) problem and solved using MATLAB R2025b with CPLEX 12.10. Simulation results demonstrate that the integrated approach significantly optimizes system performance. Total carbon emissions are reduced by 49.6 tons. Meanwhile, revenues for the MMG Alliance, individual microgrids, and shared energy storage operators increase by 4.08% to 33.00%. The proposed framework provides a practical governance solution for Smart City multi-microgrid systems, effectively addressing the “trust-risk” challenge in decentralized urban energy markets. The findings validate that the proposed mechanism effectively fosters a trustworthy trading environment, achieving a “win-win” outcome for economic profitability and urban energy resilience. Full article

(This article belongs to the Section Smart Urban Energies and Integrated Systems)

►▼ Show Figures

Figure 1

29 pages, 46316 KB

Open AccessArticle

Adaptive Traffic Signal Control Using Deep Reinforcement Learning with Noise Injection

by Raul Alejandro Velasquez Ortiz, María Elena Lárraga Ramírez, Luis Agustín Alvarez-Icaza and Héctor Alonso Guzmán Gutiérrez

Appl. Sci. 2026, 16(8), 3833; https://doi.org/10.3390/app16083833 - 15 Apr 2026

Viewed by 200

Abstract

Adaptive traffic signal control (ATSC) remains a critical challenge for urban mobility. In this direction, deep reinforcement learning (DRL) has been widely investigated for ATSC, showing promising improvements in simulated environments. However, a noticeable gap remains between simulation-based results and practical implementations, due to reward formulations that do not address phase instability. Stochastic variations may trigger premature phase changes (“flickers”), affecting signal behavior and potentially limiting deployment in real scenarios. Although several works have examined delay, queues, and decentralized coordination, stability-focused variables remain comparatively less explored, particularly in single yet complex intersections. This study proposes a decentralized DRL model for ATSC with noise injection (ATSC-DRLNI) applied to a single intersection, introducing a stability-oriented reward function that integrates flickers, queue length, and advantage actor-critic (A2C) learning feedback. The model is evaluated in the Simulation of Urban MObility (SUMO) platform and compared against seven baseline methods, using real traffic data from a Mexican city for calibration and validation. Results suggest that penalizing flickers may contribute to more stable phase transitions, while reductions of up to 40% in queue length were observed in heavy-traffic scenarios. These findings indicate that incorporating stability-related variables into reward functions may help in implementing DRL-based ATSC studies. Full article

(This article belongs to the Section Transportation and Future Mobility)

►▼ Show Figures

Figure 1

33 pages, 85096 KB

Open AccessArticle

Modeling Seismic Resilience and Hospital Evacuation: A Comparative Analysis of Multi-Agent Reinforcement Learning and Classical Evacuation Models

by Chunlin Bian, Yonghao Guo, Gang Meng, Liuyang Li, Hua Chen, Fuhong Lv and Xiaofeng Chai

Buildings 2026, 16(8), 1538; https://doi.org/10.3390/buildings16081538 - 14 Apr 2026

Viewed by 159

Abstract

Hospitals in earthquake-prone regions must evacuate heterogeneous occupants rapidly while preserving operational continuity under disrupted conditions. However, many hospital-evacuation studies still rely on static routing assumptions or narrowly defined behavioral rules, which limits their value for building-level resilience planning. This paper develops a comparative hospital-campus evacuation framework that combines GIS-based geodesic routing, heterogeneous agent-based modeling, and reinforcement-learning-based decision policies. Puge County People’s Hospital in Sichuan, China, is used as the case study. Six algorithms are evaluated: three rule-based baselines—Shortest Path (SP), Random Walk (RW), and the Social Force Model (SFM)—together with a training-free density-aware heuristic, Density-Aware Gradient Routing (DAGR), and two reinforcement-learning approaches, Density-Aware Q-Learning (DAQL) and SARSA. Experiments cover three population scales (

N \in {50, 100, 200}

), normal daytime conditions, staffing-variation scenarios, and a blocked-exit disruption scenario, with 30 independent runs for each main condition. The results show that the rule-based and training-free methods remain the most reliable under full multi-agent evaluation: the SFM and RW achieve the highest completion ratios (approximately 100% and 93.5%, respectively), while DAGR provides the strongest balance between completion and evacuation efficiency among the non-trained methods. In contrast, the trained RL agents perform substantially worse in direct multi-agent deployment with DAQL reaching approximately 37% completion and SARSA approximately 17%, highlighting a train–evaluation distribution shift associated with independent Q-learning. The ablation analysis further shows that collision avoidance is the most critical reward component, whereas density-avoidance shaping can unintentionally induce collective deadlock when all agents execute the learned policy simultaneously. Among the enhanced variants, DAQL_RoleAware yields the best overall improvement, increasing the completion ratio to approximately 52% and reducing the 90th-percentile evacuation time to approximately 363 s. Overall, this paper clarifies both the promise and the present limitations of density-aware reinforcement learning for hospital evacuation while providing a more building-centred and reproducible basis for future coordination-aware evacuation design and emergency-planning research. Full article

(This article belongs to the Special Issue Innovative Solutions for Enhancing Seismic Resilience of Buildings)

32 pages, 12012 KB

Open AccessArticle

Multi-Agent Reinforcement Learning-Based Intelligent Game Guidance with Complex Constraint

by Fucong Liu, Yang Guo, Shaobo Wang, Jin Wang and Zhengquan Liu

Aerospace 2026, 13(4), 365; https://doi.org/10.3390/aerospace13040365 - 14 Apr 2026

Viewed by 213

Abstract

For the complex problems of multi-aircraft cooperative game guidance with No-Fly Zone (NFZ) avoidance and cross-task constraint propagation, a deep deterministic policy gradient algorithm with temporal awareness and priority cooperative optimization (TP-MADDPG) is proposed. Based on the three-body cooperative guidance, a new coupled guidance task is formed by adding the NFZ avoidance constraint. At the same time, considering the constraint compatibility problem in dynamic task switching, the cooperative aircraft are modeled as independent agents with differentiated policy networks. First, a nonlinear kinematic model of the three-body game constructed by Evader–Pursuer–Defender is established. And four complex constraint conditions, namely homing guidance, NFZ avoidance, collision avoidance, and cooperative guidance, are modeled separately. Secondly, the Long Short-Term Memory-based (LSTM) Actor–Critic framework is proposed to dynamically capture the evolution patterns of adversarial scenarios by mining hidden correlations in historical state-action sequences. This enables smooth policy transitions between the cooperative guidance phase and subsequent homing guidance phase, effectively addressing the challenges of environmental non-stationarity and temporal task dependencies. Then, a priority-driven adaptive sampling mechanism is proposed along with a heterogeneous roles cooperative reward function to specifically address credit assignment imbalance and sparse reward problems, respectively. The sampling mechanism capitalizes on the efficient retrieval properties of SumTree data structures while integrating bias correction techniques to expedite policy gradient convergence. The reward function utilizes the reward shaping method to formulate cooperative reward components that explicitly capture behavioral correlations among agents. Finally, simulations show that the proposed method significantly outperforms multi-agent reinforcement learning baselines, effectively improving the performance of cooperative game guidance under complex constraints. Full article

(This article belongs to the Special Issue Flight Guidance and Control)

►▼ Show Figures

Figure 1

Show export options Show export options

Select all

Export citation of selected articles as:

Error

Oops... you haven't selected anything for export.

Displaying article 1-50 on page 1 of 95.

Go to page 1 2 3 4 5

Search Results (4,701)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI