Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (113)

Search Parameters:
Keywords = partially observed Markov decision process

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
21 pages, 4738 KiB  
Article
Research on Computation Offloading and Resource Allocation Strategy Based on MADDPG for Integrated Space–Air–Marine Network
by Haixiang Gao
Entropy 2025, 27(8), 803; https://doi.org/10.3390/e27080803 - 28 Jul 2025
Viewed by 231
Abstract
This paper investigates the problem of computation offloading and resource allocation in an integrated space–air–sea network based on unmanned aerial vehicle (UAV) and low Earth orbit (LEO) satellites supporting Maritime Internet of Things (M-IoT) devices. Considering the complex, dynamic environment comprising M-IoT devices, [...] Read more.
This paper investigates the problem of computation offloading and resource allocation in an integrated space–air–sea network based on unmanned aerial vehicle (UAV) and low Earth orbit (LEO) satellites supporting Maritime Internet of Things (M-IoT) devices. Considering the complex, dynamic environment comprising M-IoT devices, UAVs and LEO satellites, traditional optimization methods encounter significant limitations due to non-convexity and the combinatorial explosion in possible solutions. A multi-agent deep deterministic policy gradient (MADDPG)-based optimization algorithm is proposed to address these challenges. This algorithm is designed to minimize the total system costs, balancing energy consumption and latency through partial task offloading within a cloud–edge-device collaborative mobile edge computing (MEC) system. A comprehensive system model is proposed, with the problem formulated as a partially observable Markov decision process (POMDP) that integrates association control, power control, computing resource allocation, and task distribution. Each M-IoT device and UAV acts as an intelligent agent, collaboratively learning the optimal offloading strategies through a centralized training and decentralized execution framework inherent in the MADDPG. The numerical simulations validate the effectiveness of the proposed MADDPG-based approach, which demonstrates rapid convergence and significantly outperforms baseline methods, and indicate that the proposed MADDPG-based algorithm reduces the total system cost by 15–60% specifically. Full article
(This article belongs to the Special Issue Space-Air-Ground-Sea Integrated Communication Networks)
Show Figures

Figure 1

33 pages, 4841 KiB  
Article
Research on Task Allocation in Four-Way Shuttle Storage and Retrieval Systems Based on Deep Reinforcement Learning
by Zhongwei Zhang, Jingrui Wang, Jie Jin, Zhaoyun Wu, Lihui Wu, Tao Peng and Peng Li
Sustainability 2025, 17(15), 6772; https://doi.org/10.3390/su17156772 - 25 Jul 2025
Viewed by 305
Abstract
The four-way shuttle storage and retrieval system (FWSS/RS) is an advanced automated warehousing solution for achieving green and intelligent logistics, and task allocation is crucial to its logistics efficiency. However, current research on task allocation in three-dimensional storage environments is mostly conducted in [...] Read more.
The four-way shuttle storage and retrieval system (FWSS/RS) is an advanced automated warehousing solution for achieving green and intelligent logistics, and task allocation is crucial to its logistics efficiency. However, current research on task allocation in three-dimensional storage environments is mostly conducted in the single-operation mode that handles inbound or outbound tasks individually, with limited attention paid to the more prevalent composite operation mode where inbound and outbound tasks coexist. To bridge this gap, this study investigates the task allocation problem in an FWSS/RS under the composite operation mode, and deep reinforcement learning (DRL) is introduced to solve it. Initially, the FWSS/RS operational workflows and equipment motion characteristics are analyzed, and a task allocation model with the total task completion time as the optimization objective is established. Furthermore, the task allocation problem is transformed into a partially observable Markov decision process corresponding to reinforcement learning. Each shuttle is regarded as an independent agent that receives localized observations, including shuttle position information and task completion status, as inputs, and a deep neural network is employed to fit value functions to output action selections. Correspondingly, all agents are trained within an independent deep Q-network (IDQN) framework that facilitates collaborative learning through experience sharing while maintaining decentralized decision-making based on individual observations. Moreover, to validate the efficiency and effectiveness of the proposed model and method, experiments were conducted across various problem scales and transport resource configurations. The experimental results demonstrate that the DRL-based approach outperforms conventional task allocation methods, including the auction algorithm and the genetic algorithm. Specifically, the proposed IDQN-based method reduces the task completion time by up to 12.88% compared to the auction algorithm, and up to 8.64% compared to the genetic algorithm across multiple scenarios. Moreover, task-related factors are found to have a more significant impact on the optimization objectives of task allocation than transport resource-related factors. Full article
Show Figures

Figure 1

31 pages, 1576 KiB  
Article
Joint Caching and Computation in UAV-Assisted Vehicle Networks via Multi-Agent Deep Reinforcement Learning
by Yuhua Wu, Yuchao Huang, Ziyou Wang and Changming Xu
Drones 2025, 9(7), 456; https://doi.org/10.3390/drones9070456 - 24 Jun 2025
Viewed by 516
Abstract
Intelligent Connected Vehicles (ICVs) impose stringent requirements on real-time computational services. However, limited onboard resources and the high latency of remote cloud servers restrict traditional solutions. Unmanned Aerial Vehicle (UAV)-assisted Mobile Edge Computing (MEC), which deploys computing and storage resources at the network [...] Read more.
Intelligent Connected Vehicles (ICVs) impose stringent requirements on real-time computational services. However, limited onboard resources and the high latency of remote cloud servers restrict traditional solutions. Unmanned Aerial Vehicle (UAV)-assisted Mobile Edge Computing (MEC), which deploys computing and storage resources at the network edge, offers a promising solution. In UAV-assisted vehicular networks, jointly optimizing content and service caching, computation offloading, and UAV trajectories to maximize system performance is a critical challenge. This requires balancing system energy consumption and resource allocation fairness while maximizing cache hit rate and minimizing task latency. To this end, we introduce system efficiency as a unified metric, aiming to maximize overall system performance through joint optimization. This metric comprehensively considers cache hit rate, task computation latency, system energy consumption, and resource allocation fairness. The problem involves discrete decisions (caching, offloading) and continuous variables (UAV trajectories), exhibiting high dynamism and non-convexity, making it challenging for traditional optimization methods. Concurrently, existing multi-agent deep reinforcement learning (MADRL) methods often encounter training instability and convergence issues in such dynamic and non-stationary environments. To address these challenges, this paper proposes a MADRL-based joint optimization approach. We precisely model the problem as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP) and adopt the Multi-Agent Proximal Policy Optimization (MAPPO) algorithm, which follows the Centralized Training Decentralized Execution (CTDE) paradigm. Our method aims to maximize system efficiency by achieving a judicious balance among multiple performance metrics, such as cache hit rate, task delay, energy consumption, and fairness. Simulation results demonstrate that, compared to various representative baseline methods, the proposed MAPPO algorithm exhibits significant superiority in achieving higher cumulative rewards and an approximately 82% cache hit rate. Full article
Show Figures

Figure 1

20 pages, 690 KiB  
Article
Using Graph-Enhanced Deep Reinforcement Learning for Distribution Network Fault Recovery
by Yueran Liu, Peng Liao and Yang Wang
Machines 2025, 13(7), 543; https://doi.org/10.3390/machines13070543 - 23 Jun 2025
Viewed by 421
Abstract
Fault recovery in distribution networks is a complex, high-dimensional decision-making task characterized by partial observability, dynamic topology, and strong interdependencies among components. To address these challenges, this paper proposes a graph-based multi-agent deep reinforcement learning (DRL) framework for intelligent fault restoration in power [...] Read more.
Fault recovery in distribution networks is a complex, high-dimensional decision-making task characterized by partial observability, dynamic topology, and strong interdependencies among components. To address these challenges, this paper proposes a graph-based multi-agent deep reinforcement learning (DRL) framework for intelligent fault restoration in power distribution networks. The restoration problem is modeled as a partially observable Markov decision process (POMDP), where each agent employs graph neural networks to extract topological features and enhance environmental perception. To address the high-dimensionality of the action space, an action decomposition strategy is introduced, treating each switch operation as an independent binary classification task, which improves convergence and decision efficiency. Furthermore, a collaborative reward mechanism is designed to promote coordination among agents and optimize global restoration performance. Experiments on the PG&E 69-bus system demonstrate that the proposed method significantly outperforms existing DRL baselines. Specifically, it achieves up to 2.6% higher load recovery, up to 0.0 p.u. lower recovery cost, and full restoration in the midday scenario, with statistically significant improvements (p<0.05 or p<0.01). These results highlight the effectiveness of graph-based learning and cooperative rewards in improving the resilience, efficiency, and adaptability of distribution network operations under varying conditions. Full article
(This article belongs to the Section Machines Testing and Maintenance)
Show Figures

Figure 1

24 pages, 19686 KiB  
Article
Enhancing Geomagnetic Navigation with PPO-LSTM: Robust Navigation Utilizing Observed Geomagnetic Field Data
by Xiaohui Zhang, Wenqi Bai, Jun Liu, Songnan Yang, Ting Shang and Haolin Liu
Sensors 2025, 25(12), 3699; https://doi.org/10.3390/s25123699 - 13 Jun 2025
Viewed by 465
Abstract
Geospatial navigation in GPS-denied environments presents significant challenges, particularly for autonomous vehicles operating in complex, unmapped regions. We explore the Earth’s geomagnetic field, a globally distributed and naturally occurring resource, as a reliable alternative for navigation. Since vehicles can only observe the geomagnetic [...] Read more.
Geospatial navigation in GPS-denied environments presents significant challenges, particularly for autonomous vehicles operating in complex, unmapped regions. We explore the Earth’s geomagnetic field, a globally distributed and naturally occurring resource, as a reliable alternative for navigation. Since vehicles can only observe the geomagnetic field along their traversed paths, they must rely on incomplete information to infer the navigation strategy; therefore, we formulate the navigation problem as a partially observed Markov decision process (POMDP). To address this POMDP, we employ proximal policy optimization with long short-term memory (PPO-LSTM), a deep reinforcement learning framework that captures temporal dependencies and mitigates the effects of noise. Using real-world geomagnetic data from the international geomagnetic reference field (IGRF) model, we validate our approach through experiments under noisy conditions. The results demonstrate that PPO-LSTM outperforms baseline algorithms, achieving smoother trajectories and higher heading accuracy. This framework effectively handles the uncertainty and partial observability inherent in geomagnetic navigation, enabling robust policies that adapt to complex gradients and offering a robust solution for geospatial navigation. Full article
Show Figures

Figure 1

30 pages, 16390 KiB  
Article
Model-Based RL Decision-Making for UAVs Operating in GNSS-Denied, Degraded Visibility Conditions with Limited Sensor Capabilities
by Sebastien Boiteau, Fernando Vanegas, Julian Galvez-Serna and Felipe Gonzalez
Drones 2025, 9(6), 410; https://doi.org/10.3390/drones9060410 - 4 Jun 2025
Viewed by 1637
Abstract
Autonomy in Unmanned Aerial Vehicle (UAV) navigation has enabled applications in diverse fields such as mining, precision agriculture, and planetary exploration. However, challenging applications in complex environments complicate the interaction between the agent and its surroundings. Conditions such as the absence of a [...] Read more.
Autonomy in Unmanned Aerial Vehicle (UAV) navigation has enabled applications in diverse fields such as mining, precision agriculture, and planetary exploration. However, challenging applications in complex environments complicate the interaction between the agent and its surroundings. Conditions such as the absence of a Global Navigation Satellite System (GNSS), low visibility, and cluttered environments significantly increase uncertainty levels and cause partial observability. These challenges grow when compact, low-cost, entry-level sensors are employed. This study proposes a model-based reinforcement learning (RL) approach to enable UAVs to navigate and make decisions autonomously in environments where the GNSS is unavailable and visibility is limited. Designed for search and rescue operations, the system enables UAVs to navigate cluttered indoor environments, detect targets, and avoid obstacles under low-visibility conditions. The architecture integrates onboard sensors, including a thermal camera to detect a collapsed person (target), a 2D LiDAR and an IMU for localization. The decision-making module employs the ABT solver for real-time policy computation. The framework presented in this work relies on low-cost, entry-level sensors, making it suitable for lightweight UAV platforms. Experimental results demonstrate high success rates in target detection and robust performance in obstacle avoidance and navigation despite uncertainties in pose estimation and detection. The framework was first assessed in simulation, compared with a baseline algorithm, and then through real-life testing across several scenarios. The proposed system represents a step forward in UAV autonomy for critical applications, with potential extensions to unknown and fully stochastic environments. Full article
Show Figures

Figure 1

20 pages, 1778 KiB  
Article
Energy Management for Distributed Carbon-Neutral Data Centers
by Wenting Chang, Chuyi Liu, Guanyu Ren and Jianxiong Wan
Energies 2025, 18(11), 2861; https://doi.org/10.3390/en18112861 - 30 May 2025
Cited by 1 | Viewed by 340
Abstract
With the continuous expansion of data centers, their carbon emission has become a serious issue. A number of studies are committing to reduce the carbon emission of data centers. Carbon trading, carbon capture, and power-to-gas technologies are promising emission reduction techniques which are, [...] Read more.
With the continuous expansion of data centers, their carbon emission has become a serious issue. A number of studies are committing to reduce the carbon emission of data centers. Carbon trading, carbon capture, and power-to-gas technologies are promising emission reduction techniques which are, however, seldom applied to data centers. To bridge this gap, we propose a carbon-neutral architecture for distributed data centers, where each data center consists of three subsystems, i.e., an energy subsystem for energy supply, thermal subsystem for data center cooling, and carbon subsystem for carbon trading. Then, we formulate the energy management problem as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP) and develop a distributed solution framework using Multi-Agent Deep Deterministic Policy Gradient (MADDPG). Finally, simulations using real-world data show that a cost saving of 20.3% is provided. Full article
Show Figures

Figure 1

27 pages, 5560 KiB  
Article
A Stackelberg Trust-Based Human–Robot Collaboration Framework for Warehouse Picking
by Yang Liu, Fuqiang Guo and Yan Ma
Systems 2025, 13(5), 348; https://doi.org/10.3390/systems13050348 - 3 May 2025
Viewed by 567
Abstract
The warehouse picking process is one of the most critical components of logistics operations. Human–robot collaboration (HRC) is seen as an important trend in warehouse picking, as it combines the strengths of both humans and robots in the picking process. However, in current [...] Read more.
The warehouse picking process is one of the most critical components of logistics operations. Human–robot collaboration (HRC) is seen as an important trend in warehouse picking, as it combines the strengths of both humans and robots in the picking process. However, in current human–robot collaboration frameworks, there is a lack of effective communication between humans and robots, which results in inefficient task execution during the picking process. To address this, this paper considers trust as a communication bridge between humans and robots and proposes the Stackelberg trust-based human–robot collaboration framework for warehouse picking, aiming to achieve efficient and effective human–robot collaborative picking. In this framework, HRC with trust for warehouse picking is defined as the Partially Observable Stochastic Game (POSG) model. We model human fatigue with the logistic function and incorporate its impact on the efficiency reward function of the POSG. Based on the POSG model, belief space is used to assess human trust, and human strategies are formed. An iterative Stackelberg trust strategy generation (ISTSG) algorithm is designed to achieve the optimal long-term collaboration benefits between humans and robots, which is solved by the Bellman equation. The generated human–robot decision profile is formalized as a Partially Observable Markov Decision Process (POMDP), and the properties of human–robot collaboration are specified as PCTL (probabilistic computation tree logic) with rewards, such as efficiency, accuracy, trust, and human fatigue. The probabilistic model checker PRISM is exploited to verify and analyze the corresponding properties of the POMDP. We take the popular human–robot collaboration robot TORU as a case study. The experimental results show that our framework improves the efficiency of human–robot collaboration for warehouse picking and reduces worker fatigue while ensuring the required accuracy of human–robot collaboration. Full article
Show Figures

Figure 1

21 pages, 9553 KiB  
Article
Assisted-Value Factorization with Latent Interaction in Cooperate Multi-Agent Reinforcement Learning
by Zhitong Zhao, Ya Zhang, Siying Wang, Yang Zhou, Ruoning Zhang and Wenyu Chen
Mathematics 2025, 13(9), 1429; https://doi.org/10.3390/math13091429 - 27 Apr 2025
Viewed by 495
Abstract
With the development of value decomposition methods, multi-agent reinforcement learning (MARL) has made significant progress in balancing autonomous decision making with collective cooperation. However, the collaborative dynamics among agents are continuously changing. The current value decomposition methods struggle to adeptly handle these dynamic [...] Read more.
With the development of value decomposition methods, multi-agent reinforcement learning (MARL) has made significant progress in balancing autonomous decision making with collective cooperation. However, the collaborative dynamics among agents are continuously changing. The current value decomposition methods struggle to adeptly handle these dynamic changes, thereby impairing the effectiveness of cooperative policies. In this paper, we introduce the concept of latent interaction, upon which an innovative method for generating weights is developed. The proposed method derives weights from the history information, thereby enhancing the accuracy of value estimations. Building upon this, we further propose a dynamic masking mechanism that recalibrates history information in response to the activity level of agents, improving the precision of latent interaction assessments. Experimental results demonstrate the improved training speed and superior performance of the proposed method in both a multi-agent particle environment and the StarCraft Multi-Agent Challenge. Full article
(This article belongs to the Section E1: Mathematics and Computer Science)
Show Figures

Figure 1

30 pages, 3310 KiB  
Article
Enhancing Scalability and Network Efficiency in IOTA Tangle Networks: A POMDP-Based Tip Selection Algorithm
by Mays Alshaikhli, Somaya Al-Maadeed and Moutaz Saleh
Computers 2025, 14(4), 117; https://doi.org/10.3390/computers14040117 - 24 Mar 2025
Cited by 1 | Viewed by 947
Abstract
The fairness problem in the IOTA (Internet of Things Application) Tangle network has significant implications for transaction efficiency, scalability, and security, particularly concerning orphan transactions and lazy tips. Traditional tip selection algorithms (TSAs) struggle to ensure fair tip selection, leading to inefficient transaction [...] Read more.
The fairness problem in the IOTA (Internet of Things Application) Tangle network has significant implications for transaction efficiency, scalability, and security, particularly concerning orphan transactions and lazy tips. Traditional tip selection algorithms (TSAs) struggle to ensure fair tip selection, leading to inefficient transaction confirmations and network congestion. This research proposes a novel partially observable Markov decision process (POMDP)-based TSA, which dynamically prioritizes tips with lower confirmation likelihood, reducing orphan transactions and enhancing network throughput. By leveraging probabilistic decision making and the Monte Carlo tree search, the proposed TSA efficiently selects tips based on long-term impact rather than immediate transaction weight. The algorithm is rigorously evaluated against seven existing TSAs, including Random Walk, Unweighted TSA, Weighted TSA, Hybrid TSA-1, Hybrid TSA-2, E-IOTA, and G-IOTA, under various network conditions. The experimental results demonstrate that the POMDP-based TSA achieves a confirmation rate of 89–94%, reduces the orphan tip rate to 1–5%, and completely eliminates lazy tips (0%). Additionally, the proposed method ensures stable scalability and high security resilience, making it a robust and efficient solution for decentralized ledger networks. These findings highlight the potential of reinforcement learning-driven TSAs to enhance fairness, efficiency, and robustness in DAG-based blockchain systems. This work paves the way for future research into adaptive and scalable consensus mechanisms for the IOTA Tangle. Full article
Show Figures

Figure 1

17 pages, 3530 KiB  
Article
A Novel Model for Optimizing Roundabout Merging Decisions Based on Markov Decision Process and Force-Based Reward Function
by Qingyuan Shen, Haobin Jiang, Aoxue Li and Shidian Ma
Mathematics 2025, 13(6), 912; https://doi.org/10.3390/math13060912 - 9 Mar 2025
Viewed by 991
Abstract
Autonomous vehicles (AVs) are increasingly operating in complex traffic environments where safe and efficient decision-making is crucial. Merging into roundabouts is a key interaction scenario. This paper introduces a decision-making approach for roundabout merging that combines human driving behavior with advanced reinforcement learning [...] Read more.
Autonomous vehicles (AVs) are increasingly operating in complex traffic environments where safe and efficient decision-making is crucial. Merging into roundabouts is a key interaction scenario. This paper introduces a decision-making approach for roundabout merging that combines human driving behavior with advanced reinforcement learning (RL) techniques to enhance both safety and efficiency. The proposed framework models the decision-making process of AVs at roundabouts as a Markov decision process (MDP), optimizing the state, action, and reward spaces to more accurately reflect real-world driving behaviors. It simplifies the state space using relative distance and speed and defines three action profiles based on real traffic data to replicate human-like driving behavior. A force-based reward function, derived from constitutive relations, simulates vehicle-roundabout interactions, offering detailed, physically consistent feedback that enhances learning results. The results showed that this method effectively replicates human-like driving decisions, supporting the integration of AVs into dynamic traffic environments. Future research should address the challenges related to partial observability and further refine the state, action, and reward spaces. This research lays the groundwork for adaptive and interpretable decision-making frameworks for AVs, contributing to safer and more efficient traffic dynamics at roundabouts. Full article
Show Figures

Figure 1

22 pages, 2176 KiB  
Article
Deep Reinforcement Learning-Based Multi-Agent System with Advanced Actor–Critic Framework for Complex Environment
by Zihao Cui, Kailian Deng, Hongtao Zhang, Zhongyi Zha and Sayed Jobaer
Mathematics 2025, 13(5), 754; https://doi.org/10.3390/math13050754 - 25 Feb 2025
Cited by 1 | Viewed by 1853
Abstract
The development of artificial intelligence (AI) game agents that use deep reinforcement learning (DRL) algorithms to process visual information for decision-making has emerged as a key research focus in both academia and industry. However, previous game agents have struggled to execute multiple commands [...] Read more.
The development of artificial intelligence (AI) game agents that use deep reinforcement learning (DRL) algorithms to process visual information for decision-making has emerged as a key research focus in both academia and industry. However, previous game agents have struggled to execute multiple commands simultaneously in a single decision, failing to accurately replicate the complex control patterns that characterize human gameplay. In this paper, we utilize the ViZDoom environment as the DRL research platform and transform the agent–environment interactions into a Partially Observable Markov Decision Process (POMDP). We introduce an advanced multi-agent deep reinforcement learning (DRL) framework, specifically a Multi-Agent Proximal Policy Optimization (MA-PPO), designed to optimize target acquisition while operating within defined ammunition and time constraints. In MA-PPO, each agent handles distinct parallel tasks with custom reward functions for performance evaluation. The agents make independent decisions while simultaneously executing multiple commands to mimic human-like gameplay behavior. Our evaluation compares MA-PPO against other DRL algorithms, showing a 30.67% performance improvement over the baseline algorithm. Full article
(This article belongs to the Special Issue Application of Machine Learning and Data Mining, 2nd Edition)
Show Figures

Figure 1

19 pages, 21354 KiB  
Article
Asymmetric Deep Reinforcement Learning-Based Spacecraft Approaching Maneuver Under Unknown Disturbance
by Shibo Shao, Dong Zhou, Guanghui Sun, Weizhao Ma and Runran Deng
Aerospace 2025, 12(3), 170; https://doi.org/10.3390/aerospace12030170 - 20 Feb 2025
Viewed by 882
Abstract
Spacecraft approaching maneuver control normally uses traditional control methods such as Proportional–Integral–Derivative (PID) or Model Predictive Control (MPC), which require meticulous system design and lack robustness against unknown disturbances. To address these limitations, we propose an end-to-end asymmetric Deep Reinforcement Learning-based (DRL) spacecraft [...] Read more.
Spacecraft approaching maneuver control normally uses traditional control methods such as Proportional–Integral–Derivative (PID) or Model Predictive Control (MPC), which require meticulous system design and lack robustness against unknown disturbances. To address these limitations, we propose an end-to-end asymmetric Deep Reinforcement Learning-based (DRL) spacecraft approaching maneuver (ADSAM) algorithm, which significantly enhances the robutsness of the space-approaching maneuver under large-scale unknown disturbance and Partial Observation Markov Decision Processes (POMDPs). We present a numerical simulation environment with the linear Clohessy–Wiltshire (CW) model, incorporating the Runge–Kutta 4th order method (RK4) to ensure a more accurate and efficient state transition. Experimental results also demonstrate that the effectiveness of the proposed algorithm outperforms the-state-of-the-art methods. Full article
(This article belongs to the Special Issue Space Navigation and Control Technologies)
Show Figures

Figure 1

34 pages, 442 KiB  
Review
A Review of Multi-Agent Reinforcement Learning Algorithms
by Jiaxin Liang, Haotian Miao, Kai Li, Jianheng Tan, Xi Wang, Rui Luo and Yueqiu Jiang
Electronics 2025, 14(4), 820; https://doi.org/10.3390/electronics14040820 - 19 Feb 2025
Cited by 4 | Viewed by 9476
Abstract
In recent years, multi-agent reinforcement learning algorithms have demonstrated immense potential in various fields, such as robotic collaboration and game AI. This paper introduces the modeling concepts of single-agent and multi-agent systems: the fundamental principles of Markov Decision Processes and Markov Games. The [...] Read more.
In recent years, multi-agent reinforcement learning algorithms have demonstrated immense potential in various fields, such as robotic collaboration and game AI. This paper introduces the modeling concepts of single-agent and multi-agent systems: the fundamental principles of Markov Decision Processes and Markov Games. The reinforcement learning algorithms are divided into three categories: value-based, strategy-based, and actor–critic algorithms, and the algorithms and applications are introduced. Based on differences in reward functions, multi-agent reinforcement learning algorithms are further classified into three categories: fully cooperative, fully competitive, and mixed types. The paper systematically reviews and analyzes their basic principles, applications in multi-agent systems, challenges faced, and corresponding solutions. Specifically, it discusses the challenges faced by multi-agent reinforcement learning algorithms from four aspects: dimensionality, non-stationarity, partial observability, and scalability. Additionally, it surveys existing algorithm-training environments in the field of multi-agent systems and summarizes the applications of multi-agent reinforcement learning algorithms across different domains. Through this discussion, readers can gain a comprehensive understanding of the current research status and future trends in multi-agent reinforcement learning algorithms, providing valuable insights for further exploration and application in this field. Full article
(This article belongs to the Topic Agents and Multi-Agent Systems)
Show Figures

Figure 1

18 pages, 947 KiB  
Article
Joint Optimal Policy for Maintenance, Spare Unit Selection and Inventory Control Under a Partially Observable Markov Decision Process
by Nozomu Ogura, Mizuki Kasuya and Lu Jin
Mathematics 2025, 13(3), 406; https://doi.org/10.3390/math13030406 - 26 Jan 2025
Cited by 1 | Viewed by 746
Abstract
This research investigates the joint optimization of maintenance and spare unit management for series systems composed of multiple heterogeneous units. With advancements in communication and sensing technologies, condition-based maintenance has gained attention as an integral aspect of spare unit management. Furthermore, the inherent [...] Read more.
This research investigates the joint optimization of maintenance and spare unit management for series systems composed of multiple heterogeneous units. With advancements in communication and sensing technologies, condition-based maintenance has gained attention as an integral aspect of spare unit management. Furthermore, the inherent interaction between maintenance activities and spare unit management underscores the necessity of their simultaneous optimization to enhance overall system performance. Based on uncertain information about the system’s deterioration state and spare unit inventory, decision-makers determine actions related to spare units and maintenance, such as replacements, selection of spare units for corresponding units, order quantities and inventory levels. Within the framework of a partially observable Markov decision process, this research proposes an optimal joint policy for maintenance and spare unit management with the objective of minimizing total expected costs. The proposed policy is demonstrated through a three-state, two-unit series system. Sensitivity analyses and comparisons with benchmark policies are also conducted to evaluate the performance of the proposed policy and to investigate the impact of various cost parameters on the proposed decisions. Full article
(This article belongs to the Special Issue Mathematics in Advanced Reliability and Maintenance Modeling)
Show Figures

Figure 1

Back to TopTop