Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (833)

Search Parameters:
Keywords = distributed reinforcement learning

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
39 pages, 4467 KB  
Review
Deep-Sea Biomimetic Manta Ray Robots: A Comprehensive Review Based on Operational Depth Spectrum, Structures, Energy Optimization, and Control Systems
by Lugang Ye, Hongyuan Liu, Qiulin Ding, Zhongming Hu, Weikun Li, Weicheng Cui and Dixia Fan
Biomimetics 2026, 11(3), 216; https://doi.org/10.3390/biomimetics11030216 (registering DOI) - 18 Mar 2026
Abstract
As deep-sea exploration transitions from large-scale search to precision pinpoint operations, the inherent limitations of traditional “rigid-body and propeller” vehicles—specifically in low-speed maneuverability, environmental compliance, and acoustic stealth—are becoming increasingly apparent. Leveraging its unique integrated “gliding-flapping” locomotion and exceptional maneuverability, the manta ray [...] Read more.
As deep-sea exploration transitions from large-scale search to precision pinpoint operations, the inherent limitations of traditional “rigid-body and propeller” vehicles—specifically in low-speed maneuverability, environmental compliance, and acoustic stealth—are becoming increasingly apparent. Leveraging its unique integrated “gliding-flapping” locomotion and exceptional maneuverability, the manta ray serves as an ideal biological prototype for next-generation deep-sea operational platforms. From a systems engineering perspective, this paper provides a comprehensive review of the current research status and technical evolution of biomimetic manta ray submersibles. First, a technical pedigree centered on “operational depth” is established, illustrating how design paradigms transition from “mechanism replication” in shallow waters to “pressure adaptation” at full-ocean depths. Second, the mechanical challenges in structural design are explored, demonstrating that a “rigid-flexible” gradient distribution strategy is critical to resolving the conflict between pressure resistance and propulsive compliance. Regarding energy and propulsion, the synergistic effects of hybrid gliding-flapping drives and integrated structural batteries in enhancing long-range endurance and energy efficiency are analyzed. Finally, the evolution of motion control architectures—transitioning from spinal-cord-inspired Central Pattern Generator (CPG) rhythmic control to Deep Reinforcement Learning (DRL) featuring embodied intelligence—is outlined. Full article
(This article belongs to the Special Issue Bionics in Engineering Practice: Innovations and Applications)
Show Figures

Graphical abstract

35 pages, 1839 KB  
Article
Adversarially Robust Reinforcement Learning for Energy Management in Microgrids with Voltage Regulation Under Partial Observability
by Elida Domínguez, Xiaotian Zhou and Hao Liang
Energies 2026, 19(6), 1497; https://doi.org/10.3390/en19061497 - 17 Mar 2026
Abstract
Modern microgrids increasingly rely on learning-based energy management systems (EMSs) for real-time decision-making, yet remain vulnerable to cyber–physical disturbances, sensor tampering, and model uncertainty. Existing resilient control and robust reinforcement learning methods provide useful foundations, but rarely address adversarial measurement perturbations that distort [...] Read more.
Modern microgrids increasingly rely on learning-based energy management systems (EMSs) for real-time decision-making, yet remain vulnerable to cyber–physical disturbances, sensor tampering, and model uncertainty. Existing resilient control and robust reinforcement learning methods provide useful foundations, but rarely address adversarial measurement perturbations that distort belief evolution under partial observability. This gap is critical, as structured perturbations in sensing channels can destabilize learning-based policies and propagate into voltage-regulation violations. This paper proposes an adversarially robust reinforcement learning framework for energy management with voltage regulation under partial observability in microgrids. The EMS decision-making problem is formulated as a partially observable Markov decision process (POMDP) that accounts for adversarial measurement perturbations, belief evolution, and system-level economic and voltage constraints. To avoid excessive conservatism under worst-case uncertainty, an adversary-aware belief construction based on adversarial belief balancing (A3B) is employed to focus on policy-relevant perturbations. Building on this belief representation, an adversarially robust learning framework is developed by incorporating adversarial counterfactual error (ACoE) as a learning regularization mechanism, enabling a balance between nominal operating efficiency and robustness under adversarial measurement distortion. The case study is conducted on a medium-voltage radial distribution feeder (IEEE 123-Node Test Feeder). Case study results demonstrate that the proposed ACoE-regularized policies substantially reduce voltage-deficit events, improve policy stability, and maintain operational constraints under adversarial perturbations, consistently outperforming standard proximal policy optimization (PPO)-based controllers. These results indicate that counterfactual-aware, belief-based learning substantially enhances voltage quality and operational resilience in microgrids with high penetration of distributed energy resources. Full article
(This article belongs to the Special Issue Transforming Power Systems and Smart Grids with Deep Learning)
Show Figures

Figure 1

28 pages, 1600 KB  
Article
A Data-Driven Deep Reinforcement Learning Framework for Real-Time Economic Dispatch of Microgrids Under Renewable Uncertainty
by Biao Dong, Shijie Cui and Xiaohui Wang
Energies 2026, 19(6), 1481; https://doi.org/10.3390/en19061481 - 16 Mar 2026
Abstract
The real-time economic dispatch of microgrids (MGs) is challenged by the high penetration of renewable energy and the resulting source–load uncertainties. Conventional optimization-based scheduling methods rely heavily on accurate probabilistic models and often suffer from high computational burdens, which limits their real-time applicability. [...] Read more.
The real-time economic dispatch of microgrids (MGs) is challenged by the high penetration of renewable energy and the resulting source–load uncertainties. Conventional optimization-based scheduling methods rely heavily on accurate probabilistic models and often suffer from high computational burdens, which limits their real-time applicability. To address these challenges, a data-driven deep reinforcement learning (DRL) framework is proposed for real-time microgrid energy management. The MG dispatch problem is formulated as a Markov decision process (MDP), and a Deep Deterministic Policy Gradient (DDPG) algorithm is adopted to efficiently handle the high-dimensional continuous action space of distributed generators and energy storage systems (ESS). The system state incorporates renewable generation, load demand, electricity price, and ESS operational conditions, while the reward function is designed as the negative of the operational cost with penalty terms for constraint violations. A continuous-action policy network is developed to directly generate control commands without action discretization, enabling smooth and flexible scheduling. Simulation studies are conducted on an extended European low-voltage microgrid test system under both deterministic and stochastic operating scenarios. The proposed approach is compared with model-based methods (MPC and MINLP) and representative DRL algorithms (SAC and PPO). The results show that the proposed DDPG-based strategy achieves competitive economic performance, fast convergence, and good adaptability to different initial ESS conditions. In stochastic environments, the proposed method maintains operating costs close to the optimal MINLP reference while significantly reducing the online computational time. These findings demonstrate that the proposed framework provides an efficient and practical solution for the real-time economic dispatch of microgrids with high renewable penetration. Full article
Show Figures

Figure 1

43 pages, 6922 KB  
Article
Multi-Flow Hybrid Task Offloading Scheme for Multimodal High-Load V2I Services
by Weiqi Luo, Yaqi Hu, Maoqiang Wu, Yijie Zhou, Rong Yu and Junbin Qin
Electronics 2026, 15(6), 1229; https://doi.org/10.3390/electronics15061229 - 16 Mar 2026
Abstract
In the Internet of Vehicles (IoV), connected vehicles generate high-load perception tasks with large-scale and multimodal sensitive data, imposing strict requirements on latency, computing, and privacy. Existing solutions still suffer from high task service latency and privacy risks. To address these issues, this [...] Read more.
In the Internet of Vehicles (IoV), connected vehicles generate high-load perception tasks with large-scale and multimodal sensitive data, imposing strict requirements on latency, computing, and privacy. Existing solutions still suffer from high task service latency and privacy risks. To address these issues, this paper proposes an integrated framework that jointly considers multi-flow task offloading, adaptive privacy preservation, and latency-aware resource incentive mechanism. Specifically, we propose a Location-Aware and Trust-based (LA-Trust) dual-node task offloading algorithm based on deep reinforcement learning (DRL), which treats pre-partitioned subtasks as multiple parallel flows and enables flow-level collaborative offloading optimization across neighboring nodes, allows subtask data uploading and processing to proceed concurrently, and incorporates node security into decision making. To further enhance privacy protection, a Distribution-Aware Local Differential Privacy (DA-LDP) algorithm is designed to adaptively inject artificial noise according to data heterogeneity, balancing privacy protection and task execution accuracy. In addition, a Delay-Cost Reverse Auction (DC-RA) algorithm is proposed to further reduce latency by introducing wireless channel modeling between idle vehicles and edge nodes into the incentive mechanism. Experimental results show that the proposed framework improves task execution accuracy by 38% and reduces offloading cost, delay, incentive cost, and auction communication latency by 64.41%, 64.64%, 19%, and 44%, respectively, while more than 60% of tasks are offloaded to high-trust nodes. Full article
Show Figures

Figure 1

24 pages, 1253 KB  
Article
A Reinforcement Learning-Based Framework for Tariff-Aware Load Shifting in Energy-Intensive Manufacturing
by Jersson X. Leon-Medina, Mario Eduardo González Niño, Claudia Patricia Siachoque Celys, Bernardo Umbarila Suarez and Francesc Pozo
Sensors 2026, 26(6), 1858; https://doi.org/10.3390/s26061858 - 15 Mar 2026
Abstract
Optimizing energy-intensive manufacturing under time-varying electricity tariffs requires scheduling strategies that reduce cost without compromising operational feasibility. This study is grounded in readily available industrial sensing: we exclusively use time-series measurements of aggregated active power and energy at the main distribution board of [...] Read more.
Optimizing energy-intensive manufacturing under time-varying electricity tariffs requires scheduling strategies that reduce cost without compromising operational feasibility. This study is grounded in readily available industrial sensing: we exclusively use time-series measurements of aggregated active power and energy at the main distribution board of a quicklime production plant. We propose a tariff-aware load-shifting framework in which a Proximal Policy Optimization (PPO) reinforcement learning agent is trained in a custom Gymnasium environment to apply discrete consumption scaling actions constrained to 80–125% of a baseline profile during the operating shift (08:00–16:00), explicitly accounting for demand-charge exposure in the TOU peak window (13:00–15:00). The reward design combines instantaneous electricity cost with cumulative energy-tracking penalties and terms associated with operational constraints. Multi-day validation over N=30 working days shows consistent economic benefits, with a median total cost reduction on the order of 10% (narrow IQR) driven by reduced peak-window energy and demand peaks. However, the script-based binary compliance indicators (viol_energy, viol_prod_min) reveal deviations from the energy-balance criterion and occasional minimum-production shortfalls under the tolerances used, highlighting the cost–production trade-off and the need for stricter constraint handling for industrial deployment. In addition, we benchmark against dynamic programming (DP), an alternative RL policy (DQN), and a greedy heuristic (GREEDY), comparing cost; operational performance; and, when applicable, computational efficiency, which positions PPO as a competitive alternative among the considered methods. Overall, this work demonstrates how learning-based decision making can be coupled with real-world industrial sensing infrastructures, providing a data-driven tariff-aware scheduling layer for industrial energy management under practical constraints. Full article
(This article belongs to the Special Issue AI-Driven Analytics and Intelligent Sensing for Industrial Systems)
26 pages, 4680 KB  
Article
Energy-Efficient Access Point Switch On/Off in Cell-Free Massive MIMO Using Proximal Policy Optimization
by Guillermo García-Barrios, Alberto Alonso and Manuel Fuentes
Electronics 2026, 15(6), 1219; https://doi.org/10.3390/electronics15061219 - 14 Mar 2026
Abstract
The increasing densification of cell-free massive multiple-input multiple-output (MIMO) networks makes access point switch on/off (ASO) a key mechanism for improving energy efficiency in future wireless systems. While reinforcement learning (RL) has been explored for ASO, differences in modeling assumptions and evaluation scope [...] Read more.
The increasing densification of cell-free massive multiple-input multiple-output (MIMO) networks makes access point switch on/off (ASO) a key mechanism for improving energy efficiency in future wireless systems. While reinforcement learning (RL) has been explored for ASO, differences in modeling assumptions and evaluation scope leave open questions regarding robustness and scalability. In this work, ASO is investigated from an explicit energy-efficiency perspective using a RL framework based on Proximal Policy Optimization (PPO). The policy learns state-dependent AP activation under partial observability using compact per-access point (AP) large-scale fading statistics and power parameters, without requiring instantaneous small-scale channel state information or combinatorial search, enabling practical online implementation. A comprehensive evaluation is conducted under a unified and reproducible simulation framework across three cell-free deployment scenarios of increasing size that preserve AP density while incorporating realistic channel and power consumption models. Performance is assessed through both average and distribution-based metrics. Numerical results show that the PPO-based policy consistently outperforms random activation and the all-on baseline, achieving energy-efficiency improvements of up to 66% and nearly 50%, respectively, while activating a comparable number of APs. Moreover, the learned policy maintains robust performance as the network scales, reducing the likelihood of highly energy-inefficient operating regimes. Full article
Show Figures

Figure 1

20 pages, 3015 KB  
Article
A Comprehensive Cost Estimation Model for Energy-Efficient and Reliable Operation of Rainwater Pumping Stations
by Jin-Gul Joo, In-Seon Jeong, Jin-Ho You, Seungwan Han and Seung-Ho Kang
Water 2026, 18(6), 676; https://doi.org/10.3390/w18060676 - 13 Mar 2026
Viewed by 63
Abstract
The increasing frequency of torrential rainfall due to global warming has resulted in a significant rise in urban flooding and river overflows. Rainwater pumping stations, typically located near rivers, serve as buffers between sewer systems and receiving water bodies, helping to mitigate flood [...] Read more.
The increasing frequency of torrential rainfall due to global warming has resulted in a significant rise in urban flooding and river overflows. Rainwater pumping stations, typically located near rivers, serve as buffers between sewer systems and receiving water bodies, helping to mitigate flood risks. A primary challenge in operating these stations is optimizing pump performance to prevent flooding while minimizing energy consumption and costs. Various computational methods, including meta-heuristics and deep learning, have been proposed to tackle this optimization problem. However, most studies either overlook or inadequately address pump maintenance costs, which are essential for long-term operational efficiency. This gap stems from the lack of a comprehensive model that accurately captures the full spectrum of costs involved in pump operation. This paper introduces a cost estimation model that integrates both deterministic and probabilistic elements to enhance the energy-efficient operation of rainwater pumping stations. The model focuses on pumps with capacities of 100 m3/min and 170 m3/min, which are commonly used. It takes into account electricity consumption costs as well as maintenance costs arising from frequent on/off cycles and dry-run events. Predictions of failures due to these operational stresses are modeled using the Crow–AMSAA non-homogeneous Poisson process (NHPP) and Weibull distributions—probabilistic models widely used in mechanical failure analysis. To evaluate the proposed model, simulations were conducted using the Storm Water Management Model (SWMM), comparing a deep reinforcement learning-based control strategy with the current operational method at the Gasan Pumping Station in Seoul, South Korea. The pump operating costs associated with each method were calculated and analyzed using the proposed model, demonstrating its potential for ensuring cost-effective and reliable pump operation. Full article
(This article belongs to the Section Urban Water Management)
Show Figures

Figure 1

24 pages, 4049 KB  
Article
Resilience Assessment of Traditional Villages Based on Cultural Ecosystem Services—An Empirical Study of the Zuojiang Huashan Rock Art World Heritage Area in China
by Yong Lu, Liyana Hasnan and Bor Tsong Teh
Sustainability 2026, 18(6), 2845; https://doi.org/10.3390/su18062845 - 13 Mar 2026
Viewed by 74
Abstract
In this study, we explore how to balance the preservation of the original appearance of ancient villages with their development within the framework of World Heritage protection. We applied resilience theory and constructed a simple checklist, taking cultural ecosystem services into consideration, and [...] Read more.
In this study, we explore how to balance the preservation of the original appearance of ancient villages with their development within the framework of World Heritage protection. We applied resilience theory and constructed a simple checklist, taking cultural ecosystem services into consideration, and selected the Zuojiang Huashan Rock Art Heritage Area in China for field investigation, as well as conducted in-depth interviews, the distribution of short questionnaires, and two rounds of Delphi surveys. This comprehensive approach enabled us to discover the key cultural ecosystem services that villagers rely on for their livelihoods. Then, we tracked how these services enhanced buffering capacity, helped people self-organize, and promoted adaptive learning. The results show that cultural ecosystem services constitute the core framework of the social–ecological resilience of the villages. The quantity and combination of the services directly determine the resilience score, and the resilience of villages within the heritage area shows significant spatial differentiation. High-resilience villages have diverse and mutually reinforcing cultural ecosystem services and local community rules, while low-resilience villages face service loss, weakened social connections, and single development options. Through this study, we aim to further enrich the cultural connotation of resilience theory, provide a practical assessment tool for practitioners of the method, and offer practical guidance and suggestions for transforming heritage protection from static protection to a dynamic, vibrant system that promotes vitality and resilience in practice. Full article
(This article belongs to the Section Tourism, Culture, and Heritage)
Show Figures

Figure 1

21 pages, 5844 KB  
Article
A Rule-Guided Distributional Soft Actor–Critic Algorithm for Safe Lane-Changing in Complex Driving Scenarios
by Shuwan Cui, Hao Li, Yanzhao Su, Jin Huang, Kun Cheng and Huiqian Li
Vehicles 2026, 8(3), 58; https://doi.org/10.3390/vehicles8030058 - 13 Mar 2026
Viewed by 94
Abstract
Mandatory lane-changing in complex driving scenarios poses significant challenges for autonomous driving systems due to complex vehicle interactions and strict safety requirements. Existing methods often rely on handcrafted rules or extensive expert demonstrations, which increase data collection costs and provide limited safety guarantees [...] Read more.
Mandatory lane-changing in complex driving scenarios poses significant challenges for autonomous driving systems due to complex vehicle interactions and strict safety requirements. Existing methods often rely on handcrafted rules or extensive expert demonstrations, which increase data collection costs and provide limited safety guarantees during learning. To address these issues, this paper proposes a rule-guided reinforcement learning framework for lane-changing policy optimization. A lightweight rule-based controller is employed to generate initial experience, guiding the training of an improved Distributional Soft Actor–Critic with Three Refinements (DSAC-T), while a safety-aware constraint controller filters high-risk actions to ensure stable and safe learning. The proposed method is evaluated in Regular Lane Change and Lane Merging scenarios under mixed traffic composed of aggressive and conservative vehicles within a simulation environment. Simulation results show that although lane-changing success rates decrease as traffic aggressiveness increases, the proposed method consistently outperforms SAC and TD3. Notably, under highly aggressive traffic conditions with an aggressiveness ratio of 0.7, the proposed approach improves the success rate by 17.13% compared to SAC and by 10.49% compared to TD3, demonstrating superior robustness and safety in complex, high-conflict lane-changing scenarios. The present study is conducted solely in simulation and requires further validation before application to real-world traffic environments. Full article
(This article belongs to the Special Issue AI-Empowered Assisted and Autonomous Driving)
Show Figures

Figure 1

29 pages, 23079 KB  
Article
Reinforced Arctic Puffin Optimization: A Multi-Strategy Fusion Approach with a Case Study in Manipulator Trajectory Planning
by Qi Xie, Mingyang Yu, Yongxiang Li, Guanzheng Jiang and Qiaoling Du
Electronics 2026, 15(6), 1186; https://doi.org/10.3390/electronics15061186 - 12 Mar 2026
Viewed by 90
Abstract
In agricultural automation, trajectory planning for fruit-picking robot arms must satisfy dynamic obstacle avoidance and real-time control constraints in complex orchards, forming a high-dimensional, constrained optimization problem. Due to strong nonlinearity and steep gradients, traditional planners often yield high-cost trajectories with unstable quality. [...] Read more.
In agricultural automation, trajectory planning for fruit-picking robot arms must satisfy dynamic obstacle avoidance and real-time control constraints in complex orchards, forming a high-dimensional, constrained optimization problem. Due to strong nonlinearity and steep gradients, traditional planners often yield high-cost trajectories with unstable quality. This paper introduces a Reinforced Arctic Puffin Optimization (RAPO) algorithm for trajectory planning in high-dimensional, complex, constrained scenarios. RAPO improves Arctic Puffin Optimization (APO), which uses a two-stage foraging strategy but may suffer premature convergence, insufficient population diversity, and weak boundary handling. Dynamic fitness–distance balance (DFDB) adaptively coordinates exploration and exploitation. An elite-pool dynamic search strategy (DEPSS) combines t-distribution perturbation and Lévy flight to maintain diversity and enhance exploitation. A convex-lens opposition-learning boundary control method (CLOBC) improves out-of-bounds handling and reduces invalid search. Stochastic centroid opposition learning (SOBL) further suppresses premature convergence and expands coverage. On the CEC2017 benchmark (30/50/100 dimensions), RAPO outperforms nine algorithms in convergence speed and solution quality, verified by Wilcoxon and Friedman tests. In dense, narrow, and dynamic obstacle scenarios, RAPO achieves the lowest path cost, converges within 30 iterations, reduces variance, and generates smoother trajectories. This case study demonstrates RAPO’s robust mathematical performance, providing a robust and efficient framework for agricultural picking robots. Full article
Show Figures

Figure 1

22 pages, 1506 KB  
Article
Task Offloading Based on Virtual Network Embedding in Software-Defined Edge Networks: A Deep Reinforcement Learning Approach
by Lixin Ma, Peiying Zhang and Ning Chen
Information 2026, 17(3), 278; https://doi.org/10.3390/info17030278 - 10 Mar 2026
Viewed by 129
Abstract
The advent of 5G/6G technologies and the pervasive deployment of IoT devices are driving the emergence of demanding applications that necessitate ultra-low latency, high bandwidth, and significant computational power. Traditional cloud computing models fall short in meeting these stringent requirements. To address this, [...] Read more.
The advent of 5G/6G technologies and the pervasive deployment of IoT devices are driving the emergence of demanding applications that necessitate ultra-low latency, high bandwidth, and significant computational power. Traditional cloud computing models fall short in meeting these stringent requirements. To address this, Software-Defined Edge Networks (SDENs) have emerged as a promising architecture, yet efficiently managing their heterogeneous and geographically distributed resources poses substantial challenges for optimal application provisioning. In response, this paper proposes a novel framework for intelligent task offloading, which reframes the intricate multi-component application task offloading problem as a Virtual Network Embedding (VNE) challenge within a SDEN environment. We introduce a comprehensive model where complex applications are represented as Virtual Network Requests (VNRs). In this model, each VNR consists of virtual nodes that demand specific computing and storage resources, as well as virtual links that demand specific bandwidth and must adhere to maximum tolerable delay constraints. To dynamically solve this NP-hard VNE problem in the face of stochastic VNR arrivals and dynamic network conditions, we leverage Deep Reinforcement Learning (DRL). Specifically, a Soft Actor-Critic (SAC) agent is employed at the SDN controller. This agent learns a sequential decision-making policy for mapping virtual nodes to physical edge servers and virtual links to network paths. To guide the agent towards efficient resource utilization, we define the reward for each successful embedding as the long-term revenue-to-cost ratio. By learning to maximize this reward, the agent is naturally driven to find economically viable allocation strategies. Comprehensive simulation experiments demonstrate that our SAC-based VNE approach significantly outperforms other baselines across key metrics, affirming its efficacy in dynamic SDEN environments. Full article
(This article belongs to the Section Information and Communications Technology)
Show Figures

Figure 1

25 pages, 747 KB  
Article
Infection Aware Hyper-Heuristic Framework for Hospital Room–Patient Matching
by Kassem Danach, Wael Hosny Fouad Aly and Chadi Fouad Riman
Algorithms 2026, 19(3), 205; https://doi.org/10.3390/a19030205 - 9 Mar 2026
Viewed by 190
Abstract
The assignment of hospital rooms to patients is a critical operational decision that has a direct impact on patient safety, infection control, and staff workload. This study introduces HRPM–IRC, an epidemiology-aware hyper-heuristic framework developed to optimize room–patient matching by minimizing the risk of [...] Read more.
The assignment of hospital rooms to patients is a critical operational decision that has a direct impact on patient safety, infection control, and staff workload. This study introduces HRPM–IRC, an epidemiology-aware hyper-heuristic framework developed to optimize room–patient matching by minimizing the risk of nosocomial infections, reducing travel and specialty mismatch costs, and promoting equitable nurse workload distribution. A mixed-integer linear programming model is formulated to capture infection transmission probabilities, isolation and cohorting requirements, and multi-ward capacity constraints. On top of this model, a bio-inspired hyper-heuristic adaptively selects and refines low-level heuristics, including cohort-first greedy allocation, risk-gradient swaps, and pathogen-aware local MILP refinement, on the basis of contextual epidemiological indicators and reinforcement learning. The framework was validated using a real-world dataset obtained from a tertiary hospital in Lebanon, comprising 142 anonymized patient admissions, 35 rooms, and six nursing teams. Results demonstrate that HRPM–IRC consistently reduces modeled infection risk and workload imbalance by up to forty percent compared to conventional assignment heuristics while maintaining near-real-time decision-making capabilities suitable for dynamic hospital operations. These findings underscore the effectiveness of epidemiology-aware hyper-heuristics in enhancing hospital resilience, improving infection prevention, and supporting fair resource utilization in data-limited healthcare environments typical of Lebanon and other middle-income countries. Full article
Show Figures

Figure 1

39 pages, 67440 KB  
Article
LLM-TOC: LLM-Driven Theory-of-Mind Adversarial Curriculum for Multi-Agent Generalization
by Chenxu Wang, Jiang Yuan, Tianqi Yu, Xinyue Jiang, Liuyu Xiang, Junge Zhang and Zhaofeng He
Mathematics 2026, 14(5), 915; https://doi.org/10.3390/math14050915 - 8 Mar 2026
Viewed by 176
Abstract
Zero-shot generalization to out-of-distribution (OOD) teammates and opponents in multi-agent systems (MASs) remains a fundamental challenge for general-purpose AI, especially in open-ended interaction scenarios. Existing multi-agent reinforcement learning (MARL) paradigms, such as self-play and population-based training, often collapse to a limited subset of [...] Read more.
Zero-shot generalization to out-of-distribution (OOD) teammates and opponents in multi-agent systems (MASs) remains a fundamental challenge for general-purpose AI, especially in open-ended interaction scenarios. Existing multi-agent reinforcement learning (MARL) paradigms, such as self-play and population-based training, often collapse to a limited subset of Nash equilibria, leaving agents brittle when faced with semantically diverse, unseen behaviors. Recent approaches that invoke Large Language Models (LLMs) at run time can improve adaptability but introduce substantial latency and can become less reliable as task horizons grow; in contrast, LLM-assisted reward-shaping methods remain constrained by the inefficiency of the inner reinforcement-learning loop. To address these limitations, we propose LLM-TOC (LLM-Driven Theory-of-Mind Adversarial Curriculum), which casts generalization as a bi-level Stackelberg game: in the inner loop, a MARL agent (the follower) minimizes regret against a fixed population, while in the outer loop, an LLM serves as a semantic oracle that generates executable adversarial or cooperative strategies in a Turing-complete code space to maximize the agent’s regret. To cope with the absence of gradients in discrete code generation, we introduce Gradient Saliency Feedback, which transforms pixel-level value fluctuations into semantically meaningful causal cues to steer the LLM toward targeted strategy synthesis. We further provide motivating theoretical analysis via the PAC-Bayes framework, showing that LLM-TOC converges at rate O(1/K) and yields a tighter generalization error bound than parameter-space exploration under reasonable preconditions. Experiments on the Melting Pot benchmark demonstrate that, with expected cumulative collective return as the core zero-shot generalization metric, LLM-TOC consistently outperforms self-play baselines (IPPO and MAPPO) and the LLM-inference method Hypothetical Minds across all held-out test scenarios, reaching 75% to 85% of the upper-bound performance of Oracle PPO. Meanwhile, with the number of RL environment interaction steps to reach the target relative performance as the core efficiency metric, our framework reduces the total training computational cost by more than 60% compared with mainstream baselines. Full article
(This article belongs to the Special Issue Applications of Intelligent Game and Reinforcement Learning)
Show Figures

Figure 1

27 pages, 2344 KB  
Article
Cloud-Edge Resource Scheduling and Offloading Optimization Based on Deep Reinforcement Learning
by Lili Yin, Yunze Xie, Ze Zhao and Jie Gao
Sensors 2026, 26(5), 1704; https://doi.org/10.3390/s26051704 - 8 Mar 2026
Viewed by 149
Abstract
In the context of smart manufacturing, with the widespread deployment of Industrial Internet of Things (IoT) devices, a large number of computation tasks that are highly sensitive to latency and have strict deadlines have emerged, requiring real-time processing. Effectively offloading tasks to address [...] Read more.
In the context of smart manufacturing, with the widespread deployment of Industrial Internet of Things (IoT) devices, a large number of computation tasks that are highly sensitive to latency and have strict deadlines have emerged, requiring real-time processing. Effectively offloading tasks to address the issues of increased latency and task dropouts caused by dynamic changes in edge node load has become a key challenge in the cloud–edge–end collaborative environment of smart manufacturing. To tackle the complex issues of unknown edge node loads and dynamic system state changes, this paper proposes a distributed algorithm based on deep reinforcement learning, utilizing convolutional neural networks (CNN) and the Informer architecture. The proposed algorithm leverages CNN to extract local features of edge node loads while utilizing Informer’s self-attention mechanism to capture long-term load variation trends, thereby effectively handling the uncertainty and dynamics inherent in node loads. Furthermore, by integrating the Dueling Deep Q-Network (DQN) and Double DQN techniques, the algorithm achieves a precise approximation of the state–action value function, further enhancing its capability to perceive system temporal characteristics and adapt to heterogeneous tasks. Each mobile device can independently make task offloading decisions and scheduling strategies based on its observations, enabling dynamic task allocation and optimization of execution order. Simulation results show that, compared to various existing algorithms, the proposed method reduces task dropout rates by 82.3–94% and average latency by 28–39.2%. Experimental results validate the significant advantages of this method in intelligent manufacturing scenarios with high load and latency-sensitive tasks. Full article
(This article belongs to the Section Internet of Things)
Show Figures

Figure 1

20 pages, 8998 KB  
Article
Satellite Resource Allocation Strategy for the Combined Scenario of Unmanned Terminals and Mobile Users
by Cong Huo, Qiaoli Yang, Peng Li and Liu Liu
Electronics 2026, 15(5), 1107; https://doi.org/10.3390/electronics15051107 - 7 Mar 2026
Viewed by 219
Abstract
Aiming at the complex hybrid scenario where Low Earth Orbit (LEO) satellite communication systems simultaneously serve unmanned terminals and terrestrial mobile users, this study proposes a two-stage resource allocation strategy based on the Deep Deterministic Policy Gradient (DDPG) algorithm. The strategy is designed [...] Read more.
Aiming at the complex hybrid scenario where Low Earth Orbit (LEO) satellite communication systems simultaneously serve unmanned terminals and terrestrial mobile users, this study proposes a two-stage resource allocation strategy based on the Deep Deterministic Policy Gradient (DDPG) algorithm. The strategy is designed to tackle the problems of uneven traffic distribution and large discrepancies in users’ real-time requirements. First, load balancing is achieved by flexibly adjusting the mapping relationship between users and satellite beams. Then, the Time-Frequency Deep Deterministic Policy Gradient (TF-DDPG) deep reinforcement learning algorithm is adopted, through which the agent autonomously learns via training and dynamically allocates time-frequency resources within a short period, giving priority to guaranteeing the communication demands of unmanned terminals. Simulation results demonstrate that, compared with heuristic algorithms, the proposed strategy realizes millisecond-level response in resource allocation decisions and improves system resource utilization, with an average user satisfaction rate of 73.41%. This method effectively resolves the issue of satellite time-frequency resource allocation in complex hybrid scenarios and provides a practical solution for the efficient resource management of future LEO satellite internet systems. Full article
Show Figures

Figure 1

Back to TopTop