Fault Reconfiguration of Shipboard MVDC Power Systems Based on Multi-Agent Reinforcement Learning

Yao, Gang; Li, Xuan; Saim, Abdelhakim; Ait-Ahmed, Mourad; Benbouzid, Mohamed

doi:10.3390/jmse14030278

Open AccessArticle

Fault Reconfiguration of Shipboard MVDC Power Systems Based on Multi-Agent Reinforcement Learning

by

Gang Yao

¹

,

Xuan Li

^1,*,

Abdelhakim Saim

²

,

Mourad Ait-Ahmed

² and

Mohamed Benbouzid

³

¹

Department of Electrical Engineering, Shanghai Maritime University, 1550 Haigang Avenue, Shanghai 201306, China

²

Institut de Recherche en Energie Electrique de Nantes Atlantique, IREENA, Nantes University, UR 4642, 37 Bd de l’Université, 44600 Saint-Nazaire, France

³

Institut de Recherche Dupuy de Lôme (UMR CNRS 6027), University of Brest, 29238 Brest, France

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2026, 14(3), 278; https://doi.org/10.3390/jmse14030278

Submission received: 2 January 2026 / Revised: 22 January 2026 / Accepted: 27 January 2026 / Published: 29 January 2026

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

In the event of a fault in a shipboard medium-voltage direct-current (MVDC) power system, a fault reconfiguration method issues control commands to the switchgear to execute switching actions, thereby redistributing power flow, isolating the faulted zone, and restoring power to the de-energized loads. Existing fault reconfiguration strategies mainly use classical optimization methods. These methods are usually centralized, and as the system scale increases, they suffer from the curse of dimensionality, which degrades real-time performance and reduces computational efficiency. This paper proposes a MADRL-based fault reconfiguration method for shipboard MVDC power systems. The proposed method considers load priority levels, maximizes total restored load, and improves load balancing. To this end, a QMIX-based method, Dependency-Corrected QMIX with Action Masking (Dep-QMIX-Mask), was developed, introducing a dependency correction mechanism to handle action dependencies during decentralized execution and applying action masking to rule out invalid switching actions under operational constraints. A shipboard MVDC power system model was established and used for validation. Across three representative fault cases, Dep-QMIX-Mask achieves served load rates of 0.88, 0.67, and 0.43, with SLR improvements of up to 19.6% over baseline methods. It consistently produces feasible switching sequences in all 20 independent runs per case, improving feasibility by 10 to 30 percentage points. In addition, Dep-QMIX-Mask improves zonal load balancing by reducing the PUR variance by 40.5% to 99.2% compared with baseline methods. These results indicate that Dep-QMIX-Mask can generate feasible sequential reconfiguration strategies while improving both load restoration and load balancing.

Keywords:

shipboard MVDC power system; fault reconfiguration; multi-agent deep reinforcement learning; dependency correction mechanism; action masking mechanism

1. Introduction

As global energy shortages grow, climate change worsens, and environmental pollution persists, the shipping industry faces increasing pressure to achieve sustainable development through a green and low-carbon transition. However, traditional shipboard power systems have become a bottleneck due to low energy efficiency and high emissions. Shipboard MVDC power systems, which replace traditional AC distribution with medium-voltage DC distribution, offer high power density, high efficiency, flexibility, and reliability [1]. Although MVDC systems outperform conventional AC systems, their compact and highly coupled power-electronic configurations, together with short lines and low impedance, allow fault currents to propagate rapidly. In severe cases, this can interrupt critical shipboard loads and compromise operational safety [2]. Therefore, fault reconfiguration is essential to isolate faulted zones, contain fault propagation, and restore power to as many shipboard loads as possible. Fault reconfiguration in a shipboard power system is performed after a fault occurs. It isolates the faulted zone and restores as much load as possible through switching operations, while meeting requirements such as power balance, voltage limits, and equipment capacity [3].

Currently, solutions to the fault reconfiguration problem in shipboard power systems can be roughly divided into three types. The first type includes mathematical optimization methods [4,5,6,7,8,9,10,11], such as integer programming and quadratic programming, which describe the problem with clear math models to obtain the best solutions overall. The second type comprises heuristic algorithms [12], such as branch-and-bound and branch-and-cut, which employ search strategies to obtain near-optimal solutions. The third type includes intelligent optimization algorithms [13,14,15,16,17,18], including particle swarm optimization and the gray wolf algorithm, which use search methods to obtain solutions that are almost the best, multi-objective optimization problems. However, these methods often involve high computational complexity, adapt poorly to dynamic environments, struggle with high-dimensional discrete action spaces, and may fall into local optima under complex constraints.

In recent years, advances in machine learning have made deep reinforcement learning (DRL), based on Markov decision processes (MDPs), a promising approach for complex decision-making in shipboard power system operation. DRL has been widely applied to real-time scheduling [19], energy management [20,21], voltage and reactive power control [22,23], and emergency control [24]. Although DRL typically requires considerable training, online decision-making after training can be carried out with a single forward pass through the network. Yi et al. determine the optimal switch states for a given fault during real-time dynamic network reconfiguration, while simultaneously optimizing distribution network reconfiguration and distributed generator setpoints, thereby enabling efficient system restoration [25]. Huang et al. proposed a switch control strategy based on a deep Q-network (DQN) to enhance microgrid resilience during extreme weather by helping form microgrid formation within distribution networks [26]. Zhang et al. proposed a two-stage fault reconfiguration method based on a logical structure and enhanced reinforcement learning, which significantly improves reconfiguration efficiency and power supply stability in shipboard DC zonal power systems under fault conditions [27]. However, this approach focuses only on topology reconfiguration and does not include power flow calculations. In addition, it provides only a final switching plan without considering the switching sequence. Since reconfiguration is implemented step-by-step, intermediate states must remain feasible and safe; therefore, optimization that accounts for the switching sequence is required beyond final topology optimization. Oh et al. applied a DQN-based approach to optimize dynamic reconfiguration and service restoration in distribution networks, improving post-fault self-healing performance [28]. This approach enables faster restoration after faults and enhances system stability and reliability. Jo et al. proposed the RDDNR method based on deep reinforcement learning. By listing candidate radial switch sets to build the action space, RDDNR enables continuous self-healing reconfiguration, thereby improving restoration efficiency and system resilience. However, it does not consider load shedding or the switching sequence [29].

Fault reconfiguration is essentially a switch configuration optimization problem. As shipboard MVDC systems scale up, the number of switches increases. The DRL action space grows exponentially, making training with a single agent difficult. MADRL mitigates this by decomposing the joint action space into subspaces with fewer dimensions. Yao et al. applied multi-agent deep reinforcement learning to critical load restoration in distribution systems, where each agent independently makes decisions based on local generator and load information to optimize the restoration process [30]. However, this approach does not explicitly model agent interaction mechanisms and relies solely on local observation information, making it difficult to achieve collaborative optimization toward global objectives such as cross-regional load balancing and overall coordination. Si et al. proposed a multi-agent deep reinforcement learning-based restoration method for distribution systems, modeling source-grid-load cooperative restoration as a partially observable stochastic game. They introduced action masks and a dynamic agent network, enhancing load restoration capability and computational efficiency during multi-step fault reconfiguration [31]. However, this research primarily targets distribution network scenarios, with optimization objectives focused on maximizing power restoration. It still lacks comprehensive trade-offs between load balancing and overall system performance.

Existing DRL/MADRL studies on fault reconfiguration remain limited in several respects. With increasing switch numbers, single-agent formulations face an exponential growth of the action space, which substantially raises computational cost and can impair convergence. MADRL reduces the dimensionality of decision-making, yet explicit modeling of inter-agent dependencies is still lacking, weakening system-level coordination. Furthermore, most formulations emphasize maximizing restored load, while load balancing and overall system performance are less thoroughly addressed. To address the aforementioned issues, this paper proposes a MADRL-based fault reconfiguration method for shipboard MVDC power systems. The method aims to maximize total restored load while improving load balancing, with load priorities taken into account. First, the fault reconfiguration problem is modeled as a decentralized partially observable Markov decision process (Dec-POMDP). Then, a Dep-QMIX-Mask algorithm is developed by integrating QMIX with dependency correction and an action masking mechanism to solve this problem. Finally, the effectiveness of the proposed method is validated through multiple case studies and comparative analyses.

The main contributions of this paper are summarized as follows:

(1): A multi-agent Dec-POMDP formulation is presented for shipboard MVDC fault reconfiguration, and the problem is decomposed into multiple agents to reduce the joint action space and support decentralized execution based on local observations.
(2): A stepwise constrained reconfiguration objective is established to prioritize load restoration under load priorities while promoting zonal load balancing, with operational constraints enforced during the switching sequence.
(3): Based on QMIX, dependency correction and action masking are integrated to handle action dependencies among agents and filter infeasible or redundant actions during decentralized execution, improving the feasibility of the generated switching sequences.

The remainder of this paper is organized as follows. Section 2 models the fault reconfiguration problem in shipboard MVDC power systems as a Dec-POMDP. It defines the system state space, the agents’ action space, the local observation space, the reward function, the objective function, and the constraints for fault reconfiguration. Section 3 details the proposed Dep-QMIX-Mask architecture. It explains the dependency correction and action masking mechanisms, the mixing logic for multi-agent value functions, and the training procedure. Section 4 presents case studies using shipboard MVDC fault scenarios covering different fault types and severity levels. It compares the proposed method with other approaches in terms of the served load factor and load balancing performance. Section 5 summarizes the paper and outlines directions for future work.

2. Modeling Fault Reconfiguration in Shipboard MVDC Power Systems as a Dec-POMDP

2.1. Shipboard MVDC Power Systems: Structure and Directed-Graph Representation

According to IEEE standards, shipboard MVDC power systems typically adopt radial or ring-type zonal architectures [32]. Due to the higher power supply reliability and optimized ship space layout offered by the annular zone structure, the shipboard MVDC power system studied in this paper adopts this configuration, as shown in Figure 1. The system is supplied by four diesel generators. The generated AC power is converted to medium-voltage DC by rectifiers and fed to the main DC bus, which then supplies the shipboard loads through the electrical distribution network. The primary loads in Figure 1 have dual supply paths and can be served by either the port or starboard bus. Secondary and tertiary loads have only a single supply path. When generation capacity is insufficient, tertiary and secondary loads can be shed to ensure that primary loads remain energized.

To simplify the analysis, this paper assumes that after a fault occurs in the shipboard MVDC power system shown in Figure 1, the circuit breakers adjacent to the faulted area trip promptly to isolate the faulted zone.

From a graph-theoretic perspective, busbars, generators, and loads can be modeled as nodes, and power lines as directed edges. Accordingly, the topology of a shipboard MVDC power system can be represented as a directed graph G = (V, x), where V denotes the set of nodes (system components), and x denotes the set of edges (power lines) connecting these nodes. Thus, the shipboard MVDC power system shown in Figure 1 can be represented as a directed graph, as illustrated in Figure 2. In Figure 2, light-blue nodes represent medium-voltage DC buses, green nodes represent shipboard loads, and dark-blue nodes represent generators. All belong to the node set V. Solid lines denote energized lines in normal operation, whereas dashed lines denote de-energized backup lines. All belong to the edge set x. The light-blue solid line represents the main DC bus line, which serves as the primary path for power transmission. The green line denotes the load line, which controls the connection between loads and the bus. The dark-blue line indicates the generator line, which controls the connection between generators and the port/starboard buses. The orange solid line represents the zone tie line, which connects different zones to regulate power transfer.

For the directed-graph representation of the shipboard MVDC power system shown in Figure 2, an adjacency matrix

D = {(x_{i j})}_{n \times n}

is used to describe the adjacency relationships between nodes. Let n denote the total number of nodes. Each matrix entry takes a value of 0 or 1, indicating the following connection states between node i and node j:

x_{i j} = \{\begin{array}{l} 1, & if node i is connected to node j \\ 0, & otherwise \end{array}

(1)

In the shipboard MVDC power system, normally open standby lines (dashed) represent redundant branches in the network topology. In normal operation, these branches remain open. Therefore, the corresponding entries in the adjacency matrix are set to 0, reflecting the actual power transfer paths of the system.

2.2. Optimization Objectives and Operational Constraints for Fault Reconfiguration in Shipboard MVDC Power Systems

Compared with land-based distribution systems, shipboard MVDC power systems typically adopt a zonal architecture. Owing to space constraints and equipment configuration, power-supply redundancy is limited [32]. Moreover, faults tend to be localized and may propagate in a cascading manner [33]. Therefore, once a fault occurs, the system must isolate the faulted zone and reconfigure power paths within a very short time to maintain the operation of critical loads.

2.2.1. Objective Function

Based on the operational requirements of shipboard MVDC power systems, this paper formulates fault reconfiguration in shipboard MVDC power systems with the following three optimization objectives:

\max \sum_{i = 1}^{N_{load}} y_{i} p_{i}^{load}

(2)

\min \sum_{i = 1}^{N_{node}} | v_{i} - 1 |

(3)

\min \frac{1}{N_{zone}} \sum_{i = 1}^{N_{zone}} {(u_{i} - \bar{u})}^{2}, u_{i} = \frac{p_{i}^{zone}}{p_{i}^{g}}

(4)

Here, Equation (2) is designed to maximize the total restored load power, where

y_{i}

represents the load supply status of load i (1 indicates the load is served, 0 indicates it is not). Equation (3) minimizes the system voltage deviation. Equation (4) enhances load balancing by minimizing the variance of the zone utilization ratios, where

u_{i}

represents the power utilization rate (PUR) of each zone, and

\bar{u}

denotes the average power utilization rate across all zones. The core symbols involved in Equations (2)–(4) are summarized in Table A1, characters not individually annotated in subsequent formulas may also be interpreted by referring to this table.

2.2.2. Operational Constraints

The constraints for fault reconfiguration in shipboard MVDC power systems include power flow balance equations, generator power output limits, bus capacity constraints, node voltage range constraints, and network topology constraints. Since the shipboard MVDC power system incorporates both DC and AC distribution grids, which differ in component types and constraint mechanisms, power flow constraints must be analyzed separately for the AC and DC grids [33]. The generator and load power limits are taken from Section 4, and the remaining bounds follow the literature benchmark settings.

DC grid power constraints can be expressed as follows [33]:

\sum_{k} i_{in, k} = \sum_{k} i_{out, k} + i_{k}^{load}

(5)

v_{k} - v_{l} = i_{k l} \cdot R_{k l}

(6)

p_{k}^{load} = p_{k}^{load, \max} x_{k}

(7)

i_{k l} \leq i_{k l}^{\max}

(8)

v_{k}^{\min} \leq v_{k} \leq v_{k}^{\max}

(9)

AC grid power constraints can be expressed as follows [33]:

p_{i}^{g} - p_{i}^{load} = \sum_{j} v_{i} v_{j} (G_{i j} \cos (δ_{i} - δ_{j}) + B_{i j} \sin (δ_{i} - δ_{j}))

(10)

q_{i}^{g} - q_{i}^{load} = \sum_{j} v_{i} v_{j} (G_{i j} \sin (δ_{i} - δ_{j}) - B_{i j} \cos (δ_{i} - δ_{j}))

(11)

p_{i}^{g, \min} \leq p_{i}^{g} \leq p_{i}^{g, \max}

(12)

q_{i}^{g, \min} \leq q_{i}^{g} \leq q_{i}^{g, \max}

(13)

i_{i j} \leq i_{i j}^{\max}

(14)

v_{i}^{\min} \leq v_{i} \leq v_{i}^{\max}

(15)

where the subscript

k, l \in {1, 2, 3, \dots, 38}

denote the busbar nodes on the port and starboard sides, as well as the busbar nodes where the loads are located. The subscripts

i, j \in {39, 40, 41, 42}

denote the busbar node numbers where the generators are located.

During fault reconfiguration, this paper makes the following assumptions regarding load power supply: a single-side power supply mode is adopted, meaning the load can only connect to either the normal power supply path or the backup power supply path. Additionally, it is stipulated that when the system’s power supply capacity is insufficient, load shedding operations may be executed. Based on these assumptions, the following constraints are established [27]:

z_{t} + z_{u} \leq 1

(16)

Here,

z_{t}

denotes the power supply status of the normal path, while

z_{u}

denotes the power supply status of the backup path (both take values of 0 or 1; 1 indicates the corresponding path is enabled, and 0 indicates it is disabled).

The constraint on the power balance between zonal power supply capacity and load demand is as follows:

K_{k} p_{k}^{g} \geq p_{k}^{zone}

(17)

Here,

K_{k}

denotes the power conversion factor of the AC/DC converter to zone k, this constraint is designed to limit the total power of the zonal load to not exceed the maximum supply capacity of its connected power source.

By combining Equations (2)–(17), a multi-objective optimization model for fault reconfiguration in shipboard MVDC power systems can be established. Solving this model enables optimization of the system’s power supply topology and power distribution after a fault, thereby achieving fault reconfiguration. The fault reconfiguration problem in shipboard MVDC power systems is essentially a dynamic optimization problem characterized by significant nonlinearity and non-convexity [3]. In practice, the number of feasible switching sequences grows combinatorially with system scale, while post-fault recovery must be completed within a tight decision window In recent years, methods using DRL have been widely used. To address these challenges, this paper transforms the corresponding multi-objective optimization model for fault reconfiguration into a collaborative multi-agent decision-making task. Within the Dec-POMDP framework, the MADRL method is employed to solve it.

2.3. Dec-POMDP Modeling for Fault Reconfiguration in Shipboard MVDC Power Systems

When faults occur in a shipboard MVDC power system, limitations in communication reliability and control hierarchy make it difficult to rely on a single centralized controller to execute real-time decision-making [34]. Therefore, the fault reconfiguration process of a shipboard MVDC power system involves multi-module cooperative control. Following a fault occurrence, each control unit can only access local information and must perform decentralized cooperative decision-making under partially observable conditions, with system restoration effectiveness serving as the unified optimization objective. To this end, this paper models the fault reconfiguration process as a cooperative MADRL problem and employs the Dec-POMDP method for modeling. Dec-POMDP is a classical modeling method for describing and characterizing decentralized, partially observable cooperative decision-making problems [35]. Within this method, each agent selects actions based on local observations. The collective actions of all agents act upon the system, driving updates to the system state. The environment then provides feedback in the form of shared reward signals to guide iterative policy optimization. Dec-POMDP can be represented as the following six-tuple [36]:

G = 〈 S, {\{A_{i}\}}_{i = 1}^{N}, T, {\{O_{i}\}}_{i = 1}^{N}, R, γ 〉

(18)

where S denotes the global state space of the environment;

A_{i}

represents the action space of the agents; N indicates the number of agents; T is the transition function, expressed as

T (s_{t}, a_{t}, s_{t + 1}) = P (s_{t + 1} ∣ s_{t}, a_{t})

, representing the probability of transitioning from state S to state S’ at time step t under the combined actions

a_{t} = (a_{t}^{1}, \dots, a_{t}^{N})

of all agents; O is the observation function for each agent, expressed as

O_{i} (o_{i} ∣ s_{t}, a_{i})

, denoting the probability that an agent observes a local observation given state S and its own action; R is the shared reward function, expressed as

R (s_{t}, a_{t})

, used to evaluate the quality of the collective action

a_{t}

taken at state

s_{t}

;

γ

is the discount factor, used to balance the relationship between immediate rewards and future rewards.

The objective of Dec-POMDP is to obtain an optimal joint policy

{π_{i}^{*}}_{i = 1}^{N}

that maximizes the expected cumulative reward. The expression for the expected cumulative reward is as follows [36]:

J (θ) = E [\sum_{t} γ^{t} R (s_{t}, a_{t}^{1}, \dots, a_{t}^{N})]

(19)

where

J (θ)

denotes the expected cumulative discounted reward, serving as a measure of the agent’s long-term expected total reward under parameter θ.

a_{t}^{i} = π_{i} (o_{t}^{i})

represents the action chosen by agent i at time t based on its local observation

o_{t}^{i}

through its policy

π_{i}

. To model the uncertainty of faults, this paper randomly selects faults from a predefined set of fault types in the shipboard MVDC power system as the initial conditions for each round, thereby generating different initial states and reconfiguration tasks. Within the Dec-POMDP framework, agents make decisions step-by-step based on local observations within a finite time horizon. To implement this reconfiguration process, this paper constructs a multi-agent system comprising N = 5 agents based on the aforementioned Dec-POMDP definition. The definitions and settings for each agent and its key components are provided below.

(1) Agent: This paper configures five agents, denoted as Agent_n (n = 1…5): Agent₁, Agent₂, and Agent₃ represent primary, secondary, and tertiary load agents, respectively. They control the interconnection switches on the busbar connection lines for each load level, enabling power supply switching for the corresponding loads; Agent₄ serves as the zonal tie switch agent, responsible for opening zonal tie switches; Agent₅ functions as the generator agent, controlling generator branch tie switches to implement switching of generator supply paths. This decomposition follows common shipboard load management practice, where loads are classified by criticality to support selective shedding and restoration. Tie switch operations change network connectivity and thus determine feasible reconfiguration topologies and power flow paths. Generator branch switching determines source-side connectivity and supply routing under capacity and operating constraints. Separating these decision types into dedicated agents decouples their constraints and action spaces, which reduces decision conflicts among the three decision types.

(2) State: The state space should contain sufficient information to support effective learning and decision optimization for the intelligent agent. The state referred to in this paper denotes the global state of the system, representing the overall operational status of the shipboard MVDC power system during reconfiguration. Based on the definition of the state space S in Equation (18), this paper constructs the state space using node voltages, branch currents, line states, generator output power, load power, and fault line information, which can be expressed as follows:

s_{t} = [v_{t}, i_{t}, z_{t}, p_{t}^{g}, p_{t}^{load}, f_{t}]

(20)

where

v_{t}

,

i_{t}

,

z_{t}

,

p_{t}^{g}

,

p_{t}^{load}

, and

f_{t}

denote the node voltage magnitude, branch current, line status, generator output power, load power, and the fault line information, respectively.

(3) Action: Agents control the on/off state of tie switches to alter the operational status of corresponding lines, thereby achieving power path reconfiguration and optimizing power flow adjustments within the grid. Each round corresponds to a complete fault reconfiguration and restoration process. At discrete decision moments, the system updates its state at time step t. Agents select and execute control actions based on current observations until system restoration is complete.

Agent₁ is responsible for selecting the power supply side for primary loads. As shown in Figure 1, the system contains K = 6 primary loads. For the load k, the connection status of its two lines to the port/starboard buses is represented by the binary variables

z_{k, t}^{p}, z_{k, t}^{s} \in {0, 1}

where 1 indicates on, and 0 indicates off. The action is defined as follows:

a_{t}^{1} = (k, m_{t}), k \in {1, \dots, 6}, m_{t} \in {0, 1}

(21)

where

m_{t}

= 0/1 indicates whether a primary load selects port/starboard as its power reception path, resulting in an action space size of 6 × 2 = 12. After this step, the line status update for the load in the next time step is shown in the following equation:

(z_{k, t + 1}^{p}, z_{k, t + 1}^{s}) = \{\begin{array}{l} (1, 0), & m_{t} = 0 \\ (0, 1), & m_{t} = 1 \end{array}

(22)

Agent₂, Agent₃, and Agent₄ are responsible for controlling the on/off status of secondary loads, tertiary loads, and zone tie switches, respectively. As shown in Figure 1, each category of secondary loads, tertiary loads, and zone tie switches contains H = 6 objects. At each time step t in the round, each agent performs on/off control only on one of the six objects under its responsibility. For object h (load or zonal tie switch) controlled by Agent_n, a binary variable

z_{h, t}^{n} \in {0, 1}

represents the on/off state of the corresponding line at time t, where 1 indicates on and 0 indicates off. The actions of these three agents are defined as follows:

a_{t}^{n} = (h, u_{t}^{n}), n \in {2, 3, 4}, h \in {1, \dots, 6}, u_{t}^{n} \in {0, 1}

(23)

where

u_{t}^{n}

= 1 (or 0) indicates setting the corresponding line to on (or off). Thus, each agent’s action space size is 6 × 2 = 12. After this step, the state update for the selected object’s corresponding line at the next time step is given by the following equation:

z_{h, t + 1}^{n} = u_{t}^{n}

(24)

At time step t, Agent₅ decides the power supply direction for a single generator. The system contains F = 4 generators. Each generator f has two lines connected to the port and starboard buses, respectively. The connection status of these lines is represented by binary variables

z_{f, t}^{p}, z_{f, t}^{s} \in {0, 1}

where 1 indicates on and 0 indicates off, and only binary variable combinations {(1,0), (0,1), (1,1)} are permitted, corresponding to power supply to port side, starboard side, or both sides simultaneously. The action is defined as follows:

a_{t}^{5} = (f, b_{t}), f \in {1, \dots, 4}, b_{t} \in {0, 1, 2},

(25)

where

b_{t}

= 0, 1, 2 correspond to port side, starboard side, and both side power supply, respectively, resulting in an action space size of 4 × 3 = 12. After this time step action, the line status of this generator at the next time step is updated according to the following equation:

(z_{f, t + 1}^{b}, z_{f, t + 1}^{s}) = \{\begin{array}{l} (1, 0), & b_{t} = 0 \\ (0, 1), & b_{t} = 1 \\ (1, 1), & b_{t} = 2 \end{array}

(26)

(4) Observation: The observation space characterizes the local operational information accessible to various agents during system recovery, serving as the basis for agents to formulate control decisions. This paper constructs corresponding observation information sets for each agent. For Agent₁ to Agent₃, their observations include the real-time power of each of the six controlled loads, the states of the lines connecting the controlled loads to the port and starboard buses, and the voltage at the bus node where the load is located. Their observations can be expressed, respectively, as follows:

o_{t}^{1} = [{z_{k, t}^{p}, z_{k, t}^{s}, p_{k, t}^{load, 1}, v_{k, t}^{1}}_{k \in K}]

(27)

o_{t}^{2} = [{z_{h, t}^{2}, p_{h, t}^{load, 2}, v_{h, t}^{2}}_{h \in H}]

(28)

o_{t}^{3} = [{z_{h, t}^{3}, p_{h, t}^{load, 3}, v_{h, t}^{3}}_{h \in H}]

(29)

Agent₄’s observations include the states of the circuit lines controlled by its tie switches; Agent₅’s observations include the power output of its controlled generator, the states of the lines connecting this generator to the port and starboard buses, and the voltage at the bus node where the generator is located. Their observations are represented as follows:

o_{t}^{4} = [{z_{h, t}^{4}}_{h \in H}]

(30)

o_{t}^{5} = [{z_{f, t}^{p}, z_{f, t}^{s}, p_{f, t}^{g}, v_{f, t}^{5}}_{f \in F}]

(31)

(5) Reward: Section 2.2 presents the optimization objectives and constraints for fault reconfiguration in shipboard MVDC power systems. Based on these objectives and constraints, the reward function design must guide agents to achieve optimization goals during fault reconfiguration while ensuring system operational safety.

During the interaction between agents and the environment, the environment updates its system state based on the agents’ actions and evaluates the effectiveness of their decisions using a pre-designed reward function. The reward function constructed in this paper comprehensively considers the optimization objectives given by Equations (2)–(4) and the operational constraints specified by Equations (5)–(17). It consists of three components: a maximum load power supply term, a voltage limit violation penalty term, and a load balancing penalty term. These, respectively, characterize the power supply capacity, voltage operational safety, and load distribution balance of the system.

(a): Maximum Load Power Supply: Calculate the power of all online loads at each time point. Apply the respective weighting factors w₁, w₂, and w₃ for primary, secondary, and tertiary loads, respectively, to perform a weighted sum. This yields the reward value for that time point. The mathematical expression is as follows:

$r_{t}^{load} = w_{1} \sum_{i = 1}^{n_{1}} y_{i, t}^{1} p_{i, t}^{load, 1} + w_{2} \sum_{j = 1}^{n_{2}} y_{j, t}^{2} p_{j, t}^{load, 2} + w_{3} \sum_{k = 1}^{n_{3}} y_{k, t}^{3} p_{k, t}^{load, 3}$

(32)
(b): Voltage Limit Violation Penalty: To ensure the safety and stability of system operation, we introduce a node-voltage penalty term to penalize voltages outside the safe range. The voltage limit follows IEEE Std 1709–2018, which specifies a steady-state voltage tolerance of ±10% relative to the nominal value. For a conservative simulation setting, this paper uses a narrower range of 0.95–1.05 p.u. (±5%). The mathematical expression is as follows:

$r_{t}^{node} = - λ_{v} \sum_{i} (\max (0, v_{i, t} - 1.05) + \max (0, 0.95 - v_{i, t}))$

(33)
(c): Load Balancing Penalty: During the reconfiguration of a shipboard MVDC power system, topology changes may partition the system into multiple zones. The previously defined PUR serves as the balancing metric for this purpose, with its definition provided in Equation (4). The PUR of zone k at time t is expressed as follows:

$u_{k, t} = \frac{p_{k, t}^{zone}}{p_{k, t}^{g}}$

(34)

The closer the utilization rate of each zone is to 100%, the more balanced the resource allocation becomes. Therefore, the variance in utilization rates across zones is used to measure the degree of load balancing:

σ^{2} = \frac{1}{N_{zone}} \sum_{k = 1}^{N_{zone}} {(u_{k, t} - {\bar{u}}_{t})}^{2}

(35)

Based on this definition, the load balancing reward can be expressed by the following equation:

r_{t}^{balance} = - λ_{b} \cdot σ^{2}

(36)

Based on the above definitions, the total reward is written as a weighted sum of the three terms.

r_{t} = α_{L} {\tilde{r}}_{t}^{l o a d} + α_{V} {\tilde{r}}_{t}^{n o d e} + α_{B} {\tilde{r}}_{t}^{b a l a n c e}

(37)

The normalized terms are given by:

\begin{array}{l} {\tilde{r}}_{t}^{load} = \frac{r_{t}^{load}}{r_{ref}^{load}}, & {\tilde{r}}_{t}^{node} = \frac{r_{t}^{node}}{r_{ref}^{node}}, & {\tilde{r}}_{t}^{balance} = \frac{r_{t}^{balance}}{r_{ref}^{balance}} \end{array}

(38)

Here,

r_{ref}^{load}

is set to the restoration reward of a representative secondary load to capture the typical scale of load restoration. The voltage reference

r_{ref}^{node}

is set to the penalty magnitude when the voltage-violation level reaches the minimum level of concern

Δ v_{\min} = 0.05

p.u. The reference imbalance level

σ_{ref}^{2}

is obtained from simulation statistics, and

σ_{ref}^{2} \approx 0.03

is used. Accordingly,

r_{ref}^{balance}

is set to the penalty magnitude at

σ_{t}^{2} = σ_{ref}^{2}

. Based on these normalized terms, the weights are chosen as

α_{L} : α_{V} : α_{B} = 10 : 4 : 2

, and this ratio is selected to impose a clear priority among the three objectives. Load restoration is treated as the main performance target during reconfiguration. Voltage security is a strict operational requirement in shipboard MVDC systems, since voltage limit violations may trigger cascading effects or severe failures. Therefore, the voltage term is assigned a substantial weight to strongly discourage violations. The balancing term is kept smaller so that it mainly guides resource allocation among feasible actions without dominating the reward.

Table 1 summarizes the key elements of the Dec-POMDP model, including the agents, their actions, local observations, and the shared reward.

Based on the aforementioned Dec-POMDP definition and agent configuration, Figure 3 illustrates the interaction process between multi-agents and the shipboard MVDC power system: At time step t, each agent obtains local observations

o_{t}

according to Equations (27)–(31). and selects action

a_{t}

via policy π; After executing actions, the environment updates its state according to Equation (20), returns new observations, and generates instantaneous rewards

r_{t}

using Equation (37). This paper adopts maximizing cumulative reward as the optimization objective. The cumulative reward is obtained by weighting and summing the rewards from each time step according to Equation (19), thereby driving the iterative update of the policy.

3. Solution to Dec-POMDP Model for Fault Reconfiguration of Marine MVDC Power System Based on Dep-QMIX-Mask

Section 2 provides a detailed discussion on Dec-POMDP modeling for fault reconfiguration in shipboard MVDC power systems, designing the actions, states, and reward functions of agents in partially observable environments. Since each agent makes decisions independently based solely on local information, achieving effective collaboration under decentralized decision-making is the key challenge in solving this model. Classic methods include Value-Decomposition Networks (VDN) [37], Multi-Agent Deep Deterministic Policy Gradient (MADDPG) [38], and QMIX with monotonic value function decomposition. However, in strongly coupled systems, the assumption of independent decision-making often fails to accurately capture the dependencies between agents. At the same time, ineffective actions, lacking proper constraints, can easily disrupt the strategy learning process, thereby limiting the system’s collaborative efficiency and overall performance.

3.1. Fundamental Principles of the QMIX Algorithm

The QMIX algorithm, as a MADRL method, employs a centralized training and decentralized execution (CTDE) model to solve cooperative problems [39]. Its core idea is to combine the local action-value functions Q_i of each agent into a global action-value function through a Mixing Network (MN). This enables value evaluation using global information during the training phase, while agents only need to select actions based on their own local observations during the execution phase [40]. The network structure of QMIX is shown in Figure 4. Where Q_i represents the value function of agent i, Q_tot denotes the global value function, which is the total sum of rewards for all agents,

a_{t}^{i}

represents the action of agent i, and s denotes the global state.

In MN, each agent solves for actions that maximize its own local reward based on current interaction trajectories; the optimal action combinations of all agents collectively form the globally optimal actions, thereby achieving maximum total reward. The synergistic attainment of local and global optimality can be realized through the following formula [40].

\underset{a}{\arg \max} Q_{tot} (τ, a) = (\begin{matrix} \arg \max_{a^{1}} Q_{1} (τ^{1}, a^{1}) \\ ⋮ \\ \arg \max_{a^{N}} Q_{N} (τ^{N}, a^{N}) \end{matrix})

(39)

To satisfy Equation (39), QMIX imposes a monotonicity constraint on the Q value [40]:

\frac{\partial Q_{tot}}{\partial Q_{i}} \geq 0, \forall i \in {1, 2, \dots, N}

(40)

In QMIX, by imposing non-negative constraints on the connection parameters generated by the hypernetwork in MN, the monotonicity of the joint Q-values with respect to individual Q-values is ensured, thereby satisfying the conditions specified in Equations (39) and (40).

3.2. Description of Agent Decision-Making Dependency Relationships Based on the Deep Dependency Mechanism

Although the QMIX algorithm provides a useful decentralized multi-agent learning framework, some limitations remain in the assumptions during execution. During execution, agents typically act independently using only their local observations and per-agent Q-functions. The framework assumes that agents make decisions independently without sharing global information. However, in many cooperative tasks, especially when agents exhibit strong mutual dependencies, this assumption of independent decision-making often limits the system’s cooperation efficiency and overall performance.

Taking fault reconfiguration in shipboard MVDC power systems as an example, the system is highly coupled, meaning that one agent’s action, such as switching operations or generator synchronization, may influence other agents’ decisions. If agents can indirectly understand the operational logic of other agents, they can consider global constraints, such as load balancing, in their local decision-making. This enables more efficient coordination of operational sequences and optimizes the overall performance of system reconfiguration. To enable agents to rely only on their independent observations during execution while still achieving the effect of considering the action logic of other agents, effective strategies or mechanisms must be designed to indirectly reflect the influence of other agents.

In recent years, QMIX has been extended along several directions. Some studies reweight the training objective to mitigate learning bias and improve expressiveness [41], while others introduce residual-style correction terms to refine the joint value estimation [42]. Work on improving robustness strengthens performance under perturbed observations via adversarially motivated training strategies [43]. Beyond these, action-branching variants address large action spaces with branched decision structures and temporal modeling [44], sample-efficiency improvements incorporate multi-step forward dynamics modeling and attention mechanisms [45], and double-hypernetwork designs mitigate overestimation by architectural modifications [46]. Overall, these efforts mainly enhance QMIX through the objective, architecture, robustness training, or action-space handling, whereas our work focuses on constraint-induced inter-agent dependency during execution and improves coordination and feasibility under decentralized deployment.

To address the limitations of the traditional QMIX method, this paper first incorporates a dependency correction mechanism into the conventional QMIX algorithm to reflect the influence of other agents’ actions on an agent’s decision-making. After introducing this dependency correction mechanism, QMIX adopts a dual-stream parallel training framework. This framework trains the policy network of each agent in parallel using both dependency-based and independent-based modes. In dependency mode, the agent’s Q-value calculation is no longer limited to its own local observations but also incorporates action information from other agents. This enables the agent to fully understand the behavioral logic of other agents during training. To meet decentralized execution requirements, QMIX, with its dependency correction mechanism, can derive local policies for independency mode from strategies trained in dependency mode. During actual execution, this improved algorithm switches to independency mode. At this stage, each agent makes decisions only based on its own observations and pre-trained local policies. This ensures agents can operate autonomously in decentralized environments while avoiding reliance on global information. Although the dependency and independency modes differ in data flow and decision logic, this improved algorithm ensures that both aim to maximize global utility during training through the sharing of a single hybrid network. The principle of the QMIX algorithm with the introduced dependency correction mechanism is illustrated in Figure 5. The definitions of each internal network are as follows:

First, define an independent local action-value function

Q_{i}^{ind} (a_{i} ∣ s; ψ_{i})

for each agent i, which is also augmented with a dependency correction term

Δ_{i}^{dep} (a_{i} ∣ s, a_{< i}; ω_{i})

. Where

a_{< i} = {a_{1}, a_{2}, \dots, a_{i - 1}}

denotes the joint action of all agents whose indices are smaller than agent i, used to characterize the influence of preceding agents’ actions on agent i. These two components together form a dependency-based local action-value function

Q_{i}^{dep} (a_{i} ∣ s, a_{< i})

, which can be expressed as:

Q_{i}^{dep} (a_{i} ∣ s, a_{< i}) = Q_{i}^{ind} (a_{i} ∣ s; ψ_{i}) + Δ_{i}^{dep} (a_{i} ∣ s, a_{< i}; ω_{i})

(41)

The QMIX algorithm, incorporating the dependency correction mechanism, retains the MN structure from the traditional QMIX algorithm and employs the same Mixer to compute both the global dependent Q-value and the independent Q-value. This aligns with the consistency condition shown in Equation (39), with the calculation formula being:

Q_{tot}^{dep} (s, a) = Mixer (Q_{1}^{dep} (a_{1} ∣ s, a_{< 1}), Q_{2}^{dep} (a_{2} ∣ s, a_{< 2}), \dots, Q_{N}^{dep} (a_{N} ∣ s, a_{< N}); w, b)

(42)

Q_{tot}^{ind} (s, a) = Mixer (Q_{1}^{ind} (a_{1} ∣ s; ψ_{1}), Q_{2}^{ind} (a_{2} ∣ s; ψ_{2}), \dots, Q_{N}^{ind} (a_{N} ∣ s; ψ_{N}); w, b)

(43)

To ensure the optimality of the joint strategy, as in Equation (39), both the weights and biases of the hybrid network are dynamically generated by the hypernetwork using the global state _s as input. A linear network with a sign activation function is used to enforce non-negativity of the weights, thereby satisfying the monotonicity constraint.

Under the dependency model, a random number u is sampled from the uniform distribution on the interval (0,1), denoted as u ~U (0,1), where U (0,1) represents the uniform distribution. Each agent determines its action according to the ε-greedy rule, interacts with the environment, and collects experience to be stored in the experience replay buffer D. The specific action selection rule is expressed as follows:

a^{i} = \{\begin{array}{l} \arg \max_{a \in A_{i}^{avail}} Q_{i} (τ_{i}, a), & if u \geq ε, \\ Uniform (A_{i}^{avail}), & otherwise . \end{array}

(44)

In independency mode, the agent performs offline learning by sampling from the experience replay buffer. This paper iteratively optimizes the parameters of

Q_{i}^{ind}

,

Q_{i}^{dep}

, and Mixer using Temporal Difference (TD) error, each time drawing samples from the same batch in the replay buffer. TD targets are constructed for both the independent and dependent branches:

y^{ind} = r + γ \max_{a^{'}} Q_{tot}^{ind} (a^{'}, s^{'}; θ^{-})

(45)

y^{dep} = r + γ \max_{a^{'}} Q_{tot}^{dep} (a^{'}, s^{'}; θ^{-})

(46)

Next, the batch squared loss for both the independency and dependency modes is minimized, respectively:

L^{ind} ({[ψ_{i}]}_{i = 1}^{N}, θ) = \sum_{b = 1}^{B} [{(y^{ind} - Q_{tot}^{ind} (s_{b}, a_{b}; θ))}^{2}]

(47)

L^{dep} ({[ω_{i}]}_{i = 1}^{N}, θ) = \sum_{b = 1}^{B} [{(y^{dep} - Q_{tot}^{dep} (s_{b}, a_{b}; θ))}^{2}]

(48)

where B is the sampled batch size, and

θ

and

θ^{-}

denote the parameters of the online and target networks, respectively. The gradients of the two value streams are kept separate to prevent them from interfering with each other. Both value streams share the same mixer network, and they are updated one after another in each iteration (first, update the dependent branch, then the independent branch). The independent branch is used only to update the parameters of

Q_{i}^{ind}

; the dependent branch is used only to update the parameters of the dependent correction term

Q_{i}^{dep}

. The shared MN is updated during both steps. This allows the dependent and independent branches to focus on minimizing their respective TD errors. To improve training stability, a soft update method is employed, periodically copying parameters from the behavior network to the target network [36]:

θ^{-} \leftarrow ξ \cdot θ + (1 - ξ) \cdot θ^{-}

(49)

3.3. Mask-Based Invalid Action Masking for Agents

In the QMIX algorithm incorporating dependency correction mechanisms, the integration of dependency and independency modes effectively improves multi-agent collaboration. However, in highly constrained tasks such as fault reconfiguration of shipboard MVDC power systems, action failure remains a significant issue. Action failure happens mainly in two cases: First, because of complex system constraints, switch operations may become practically impossible or may not work for fault isolation needs, and these are naturally invalid actions. Second, under step-by-step execution rules, some actions become unnecessary after preceding steps are executed. For example, if a switch has already been switched off at an earlier time, switching it off again in later steps does not change the system’s topology but wastes decision resources.

Reward and punishment mechanisms are commonly employed to constrain the selection of invalid actions. However, relying solely on such mechanisms often leads to a trade-off: overly stringent penalties may cause agents to fall into under-exploration and become strategically rigid, while overly lenient ones fail to effectively suppress constraint violations, making it difficult to guarantee the validity of actions. Even if an agent ultimately learns a strategy that satisfies the constraints, it cannot be guaranteed that the constraints will be strictly met in all states and at all times. To address this issue, this paper further introduces action masking technology into the QMIX algorithm with a dependency correction mechanism. By explicitly constraining the set of available actions, it suppresses the selection of invalid actions. The definition is as follows [29]:

mask (Q (s_{i}, a_{i})) = \{\begin{array}{l} Q (s_{i}, a_{i}), & a_{i} \in A_{i}^{avail} \\ M, & otherwise \end{array}

(50)

At each time step, the technique constructs the set of valid actions Aavail i for each agent based on the current state. It assigns an extremely small constant value M (e.g., −1 × 10¹⁰) to the action value of invalid actions, thereby masking them during the decision-making phase. Since subsequent action selection employs a greedy strategy, agents prioritize decision-making within the set of valid actions, preventing infeasible or redundant operations. The implementation process of the action mask is shown in Figure 6. After the Q-network of each agent outputs the values for all candidate actions, it performs element-wise operations with a mask matrix containing the constant M. This suppresses the values corresponding to invalid actions, thereby restricting the decision-making process to the set of feasible actions.

In dependency mode, the action mask prevents invalid actions from participating in value evaluation and policy updates by explicitly blocking ineligible actions. During execution, the masking mechanism integrates with Dep-QMIX’s decentralized local decision-making, restricting action selection to the set of eligible actions. This ensures each agent can satisfy system operational constraints while relying on local information.

By integrating the action masking mechanism and dependency correction mechanism with the basic QMIX algorithm, this paper proposes the Dep-QMIX-Mask Algorithm, The Dep-QMIX-Mask algorithm is implemented in this paper, which follows the training process illustrated in Figure 7.

The online computational cost of the proposed method can be decomposed into two components per decision step: (i) one forward pass of the multi-agent networks to evaluate a finite discrete action set with action masking applied to suppress infeasible actions, and (ii) one environment update that includes a power-flow/feasibility check. Therefore, the order-of-magnitude online runtime can be summarized as:

T_{online} = O (H \cdot (C_{net} + C_{pf}))

(51)

where H is the decision horizon,

C_{net}

denotes the total forward-inference cost of all agent networks (including masking), and

C_{pf}

is the cost of one power-flow/feasibility check (often dominant in practice). Training is performed offline and is mainly driven by repeated simulator interactions and gradient-based updates, and action masking reduces wasted exploration on invalid actions and improves sample efficiency.

4. Case Verification and Results Analysis of Fault Reconfiguration in Shipboard MVDC Power Systems

4.1. Parameter Settings of the Verification Experiment

This section validates the feasibility and superiority of the proposed method through case studies. First, a MADRL simulation environment for shipboard MVDC power systems is built in Python, and the shipboard MVDC power system test model is validated based on this environment. The experimental environment configuration is as follows: Python 3.11, Pandapower 2.14.1, and PyTorch 2.2.2.

The system connects 12 loads, categorized into three priority levels: Primary, Secondary, and Tertiary. Weighting coefficients decrease with priority, with the highest for the Primary load. The system is also equipped with 38 tie switches and 4 distributed generators. The main distribution network’s rated voltage is set at 3 kV, and the power conversion factor K is set to 1. During initial operation, the system is divided into multiple electrical zones. Under normal operating conditions, loads within each zone are powered by their dedicated main DC bus. Additionally, the generator can be configured to supply power exclusively to either the port or starboard side, or to both sides concurrently, based on actual power requirements. Generator and load parameters are detailed in Table 2 and Table 3.

4.2. Hyperparameter Settings

In this section’s experiments, the hyperparameter settings for MADRL are shown in Table 4. The hidden layer size and the mixing dimension are set to 64 to balance representation ability and computational cost. The batch size is set to 32 to reduce the variance of gradient estimates and keep efficient updates. Exploration uses an ε-greedy policy, and ε decays from 1 to 0.05 to ensure sufficient exploration in early training and lower exploration noise in later training. The target network uses a soft update with a coefficient of 0.001 to reduce TD target drift and improve training stability. The initial learning rate is set to 0.008 and is decayed to 0.0005 using cosine annealing to reduce update magnitude near convergence and suppress late-stage oscillations. The maximum steps per episode are set to 6 to match the task horizon.

The discount factor γ is tested with γ = 0.9, γ = 0.95, and γ = 0.99, while the other settings remain unchanged. The task has a short episode, so the last-step weight is γ⁵, which is about 0.59, 0.77, and 0.95 for the three values. Therefore, γ directly changes how much the delayed outcomes affect the target value. The reward includes a supply benefit term and constraint penalty terms. The penalties mainly come from voltage violations and zone utilization constraints. These penalties show a delayed effect of early actions. Some actions can increase the supply benefit in early steps, but they may cause larger violations or imbalance penalties in later steps. A smaller γ reduces the weight of these later penalties. It can produce a faster rise at the beginning. But as the delayed penalties accumulate, the return can drop and oscillate, so the curve shows larger variance. A larger

γ

increases the sensitivity to later penalties. It supports updates that consider the full episode, so the curve is usually smoother. In Figure 8, γ = 0.99 is more stable, γ = 0.90 has larger fluctuations, and γ = 0.95 lies in between and is often close to γ = 0.99. Based on stability and final return, γ = 0.99 is used as the default setting.

To maintain smooth network updates and enhance convergence stability during the later training stages, a Cosine Annealing method is applied to the initial learning rate. This method gradually and smoothly decreases the learning rate to its minimum value as the number of training epochs increases. The update formula is as follows [47]:

l_{t} = l_{\min} + \frac{1}{2} (l_{\max} - l_{\min}) [1 + \cos (\frac{π t}{T_{\max}})]

(52)

where

l_{t}

is the learning rate at training iteration t, and

l_{\min}

and

l_{\max}

denote the minimum and initial learning rates, respectively.

T_{\max}

is the maximum number of training iterations. This formula uses a cosine function to smoothly decay the learning rate.

4.3. Comparative Verification of Training Performance for the Dep-QMIX-Mask Algorithm

To verify the effectiveness of the Dep-QMIX-Mask algorithm, this section benchmarks it against the baseline QMIX, QMIX combined with a masking mechanism (denoted as QMIX-Mask), QMIX coupled with a deep dependency mechanism (denoted as Dep-QMIX), and the proposed Dep-QMIX-Mask algorithm to illustrate the training performance of the Dep-QMIX-Mask approach.

This section adopts Cumulative Reward (CR) as the primary metric for assessing the agents’ learning performance, quantifying the total reward obtained by the agent within a single training round. Its equation is as follows:

C R = \sum_{t = 1}^{T} γ^{t - 1} r_{t}

(53)

The training phase includes multiple random faults, that include the failure of 0–2 generators, the disconnection of 1–3 lines, and mixed faults of both types. At the start of each round, a fault scenario is randomly selected from the predefined fault set to be used as the initial fault condition for that round. Training samples are made by agent-environment interactions and stored in an experience replay buffer. During training, batch samples are randomly sampled from the replay buffer for updating network parameters. At regular training intervals, the system evaluates the current learning policy to quantify the agent’s learning performance and convergence across different fault scenarios.

Figure 9 shows the CR curves during the training of each algorithm. The solid line represents the moving average of the CR per training step from independent runs with different random seeds, while the light shaded area indicates one standard deviation.

As shown in Figure 9, under the ε-greedy strategy, Dep-QMIX-Mask takes random actions with a high probability during the early training phase, thereby resulting in some invalid actions. As training progresses, the agent’s policy improves, with the exploration rate decreasing and the exploitation ratio increasing. After approximately 2000 training episodes, the overall training process converges. Due to multiple random faults during training, the best rewards vary across different fault conditions. Additionally, influenced by environment variations and the minimum exploration rate ε_min, the final training curve still exhibits some fluctuation and has not yet achieved fully stable convergence.

In contrast, the base QMIX algorithm’s MN relies solely on global state and each agent’s local utility function for value mixing, without explicitly modeling inter-agent action dependencies between agents. Consequently, its CR curve converges to a lower value. Moreover, when the joint action space is large, algorithms without action masking are more susceptible to invalid action exploration, leading to lower asymptotic value. After introducing action masking, the CR curve reaches a higher value. This occurs because the action masking mechanism restricts the selection range to the set of valid actions during the action selection, reducing invalid exploration. This allows the learning process to focus more on viable strategies, thereby improving exploration efficiency.

To verify whether the proposed algorithm can achieve performance comparable to the dependency mode under decentralized execution, this paper calculates and logs the CR values for both independency and dependency modes during training. Averaged over multiple random seeds, learning curves are plotted as shown in Figure 10. As shown in Figure 10, the CR curves of both modes show similar patterns: rapid initial growth, a transition phase, and eventual stabilization. By the late training stage, the two curves diverge minimally, displaying only slight fluctuations. These results show that the independency mode achieves returns close to those of the dependency mode even when using only local observations, indicating that it retains the coordination gains of the dependency mode.

4.4. Comparative Validation of Fault Reconfiguration Performance in Shipboard MVDC Power System

To further evaluate the effectiveness of the Dep-QMIX-Mask method in addressing the Dec-POMDP setting for fault reconstruction in shipboard MVDC power systems, this section considers three test cases with different fault types and severities from a predefined set of random faults for comparative evaluation of fault reconstruction. The proposed method is compared with the baseline QMIX algorithm and two heuristic algorithms—Gray Wolf Optimization (GWO) and Whale Optimization Algorithm (WOA). Each algorithm is run independently (20 independent trials for each case), and a evaluation is conducted based on metrics, including the served load rate (SLR), load balancing, completion time, and number of feasible solutions. Among these, SLR is defined as the ratio of the total served load power to the total system load power:

SLR = \frac{\sum_{i = 1}^{N_{load}} y_{i} P_{i}^{load}}{\sum_{i = 1}^{N_{load}} P_{i}^{load}}

(54)

4.4.1. Case 1: Post-Fault Reconfiguration in a Shipboard MVDC Power System Under G2 Outage and Line Disconnections x37 and x17

In the initial state, all zone tie switches are open, with each generator feeding both busbars. After a fault occurs, the Dep-QMIX-Mask algorithm takes a sequence of actions to restore power and reconfigure the network topology. Line x3 transitions from off to on, restoring power to secondary load L15 and primary load L3. Generator G3 supplies power exclusively to the starboard side, ensuring a stable power supply to the starboard area. After line x12 is connected, power to starboard loads is transferred from G1 to G3. Lines x16 and x7 become connected, restoring power to secondary load L11; primary load L4 is supplied by port-side generator G4. To meet overall load balancing requirements, primary loads L2 and L5 are supplied by port-side generators. Ultimately, the system is divided into three mutually independent zones. Figure 11 illustrates the system topology changes in Case 1, showing the shipboard’s topology after the failure and the reconfigured system topology.

The node voltage per unit (p.u.) values after reconfiguration are shown in Figure 12. To clearly distinguish de-energized and energized nodes, this paper assigns the voltages of de-energized nodes to 0.95 p.u. for plotting. As seen in the figure, the voltages of energized nodes remain within the acceptable range of 0.95 to 1.05 p.u., indicating that the voltage constraints are satisfied after network reconfiguration.

Figure 13 displays the SLR for each load class. The data shows that the proposed Dep-QMIX-Mask algorithm not only restores power to previously de-energized loads but also prioritizes restoring high-priority loads during recovery. For Case 1, Figure 14 reports the per-step inference time during online decision-making. Since the online inference procedure and network architecture are identical across the three cases, Case 1 is presented as a representative example. The inference latency is stable across steps, ranging from 0.018 s to 0.024 s per step. Therefore, a complete six-step reconfiguration episode takes about 0.12 s in total, obtained by summing the inference time from step1 to step6, which supports real-time deployment.

As summarized in Table 5, RL-based methods (Dep-QMIX-Mask and the baseline QMIX) exhibit a clear efficiency advantage over traditional optimizers such as WOA and GWO, achieving 7.17× and 5.25× speedups, respectively. This advantage arises because WOA and GWO rely on repeated iterative searches for switching sequences and may converge prematurely to local optima, which increases computation time and reduces the proportion of feasible solutions. In contrast, Dep-QMIX-Mask and QMIX directly output switching sequences with low inference latency, enabling faster post-fault restoration.

Table 5 summarizes the Optimal Reconfiguration Results (ORR) for each algorithm, each binary sequence corresponds to the power supply status of 18 loads: “1” indicates normal power supply to the load with the corresponding index number, whereas “0” indicates no power supply. Quantitatively, Dep-QMIX-Mask and QMIX both achieve an SLR of 0.88 with full feasibility (NFS/20 = 20/20), while WOA and GWO yield 18/20 and 16/20 feasible solutions, respectively.

In Figure 15, the grouped bar chart employs three colors to denote the three reconfigured zones. The three sets of bars for each algorithm denote its PUR in the corresponding zone. The line chart associated with the right axis shows the variance of utilization rates across different zones for each algorithm. Under the Dep-QMIX-Mask and QMIX algorithms, the PUR is 0.88 for both. While the best reconfiguration achieves the same overall PUR, differing load distributions lead to disparities in load balancing. The proposed algorithm achieves PURs of 0.76, 0.89, and 0.73 within the three zones, with a variance of approximately 0.0048; QMIX achieves 0.55, 0.89, and 0.93 with a variance of approximately 0.0291, indicating more load balancing for the proposed algorithm. While WOA can reach the same overall PUR as the proposed method, it shows inferior load-balancing and slower reconfiguration. GWO achieves a lower overall PUR (0.83) and exhibits the poorest overall performance. Overall, Dep-QMIX-Mask achieves an overall PUR of 0.88, which matches QMIX and WOA and exceeds GWO at 0.83. Meanwhile, it delivers the best load balancing with the lowest PUR variance of 0.0048, representing an 83.5% reduction relative to QMIX and WOA at 0.0291 and an 85.2% reduction relative to GWO at 0.0324.

4.4.2. Case 2: Post-Fault Reconfiguration in a Shipboard MVDC Power System Under Generator G1 and G3 Outages

In the initial state, all tie switches between zones remain open, with each generator simultaneously feeding power to both busbars. After a fault event, the Dep-QMIX-Mask algorithm executes a series of operational steps to restore power supply and reconfigure the power topology. All loads within the area previously supplied by generators G1 and G3 experienced a power outage. Lines x3, x7, x14, and x16 were closed. Generator G2 independently assumed power supply responsibilities for the port side, splitting the system into two mutually independent power zones. With Generators G1 and G3 out of operation, the remaining system capacity was found to be insufficient. To prioritize recovery of all primary loads, the secondary load L7 and tertiary loads L8, L14, L18, and L12 were disconnected.

Figure 16 illustrates the system topology changes in Case 2, showing the shipboard system topology after the failure and the reconfigured system topology. The distribution of node voltage per unit values after reconfiguration is shown in Figure 17. As shown in the figure, the node voltage per unit values of normally powered nodes is maintained within the reasonable range of 0.95 to 1.05 p.u. Figure 18 shows the SLR at each time step. The results indicate that the proposed Dep-QMIX-Mask algorithm can prioritize ensuring the power supply of high-priority loads while restoring the de-energized loads.

In this case, the PUR and their variances across zones after algorithmic reconfiguration are shown in Figure 19. Compared to the QMIX, WOA, and GWO algorithms, the Dep-QMIX-Mask algorithm not only achieves the highest PUR after reconfiguration but also exhibits the smallest variance in utilization, demonstrating a significant advantage in load balancing. Specifically, the PUR variance of Dep-QMIX-Mask is 0.0025, which is smaller than 0.0090 for QMIX, 0.0169 for WOA, and 0.0042 for GWO. Table 6 indicates that the Dep-QMIX-Mask algorithm achieves the highest served load factor after reconfiguration. It consistently produces feasible solutions across all 20 experiments, with NFS/20 equal to 20/20, and an average computation time of only 0.12 s. Compared with WOA, whose average computation time is 0.84 s, and GWO, whose average computation time is 0.52 s, Dep-QMIX-Mask provides 7.00× and 4.33× speedups, respectively. In addition, the SLR of Dep-QMIX-Mask is 0.67, which is higher than 0.63 for QMIX, 0.60 for WOA, and 0.56 for GWO, corresponding to SLR improvements of 6.3%, 11.7%, and 19.6%, respectively. Meanwhile, the PUR variance of Dep-QMIX-Mask is reduced by 72.2% compared with QMIX, reduced by 85.2% compared with WOA, and reduced by 40.5% compared with GWO, demonstrating optimal performance in both load recovery effectiveness and load balancing.

4.4.3. Case 3: Post-Fault Reconfiguration in a Shipboard MVDC Power System Under G1 and G4 Outages

In the initial state, lines x5 and x11 are open, with each generator supplying power to only one busbar. Following a fault, the Dep-QMIX-Mask algorithm restores power supply and reconfigures the power topology through a series of decisions. Line x5 was closed, supplying loads L1, L2, and L3 from generator G2. Secondary load L15, tertiary load L14, and primary loads L4, L5, and L6 are supplied from generator G3. The system is now divided into two independent supply zones. Figure 20 illustrates the system topology changes in Case 3, showing the shipboard’s topology after the failure and the reconfigured system topology. The distribution of node voltage per unit values for this case is presented in Figure 21, showing that these values remain within the safe operating limits of 0.95–1.05 p.u. Figure 22 displays the SLR at each time step. Results indicate that the proposed Dep-QMIX-Mask algorithm prioritizes power supply to high-priority loads while restoring de-energized loads. In this case, as shown in Figure 23, the Dep-QMIX-Mask algorithm achieves the highest PUR with the smallest variance after reconfiguration, demonstrating a clear advantage over QMIX, WOA, and GWO in load balancing. Table 7 indicates that the Dep-QMIX-Mask algorithm achieves the highest served load rate while maintaining computational efficiency, demonstrating superior overall performance compared to QMIX, WOA, and GWO algorithms. Specifically, Dep-QMIX-Mask achieves an SLR of 0.43, outperforming QMIX at 0.41 and WOA/GWO at 0.37, and maintains full feasibility with NFS/20 equal to 20/20. It completes reconfiguration in 0.12 s on average, providing 7.83× and 5.33× speedups over WOA and GWO, respectively. In addition, it yields the lowest PUR variance of 0.000025, which is markedly smaller than QMIX at 0.0030 and WOA/GWO at 0.0009.

After presenting the detailed results for three representative fault cases in Table 5, Table 6 and Table 7, Table 8 summarizes the overall results and reports mean ± standard deviation for SLR, NFS per 20, average time in seconds, and PUR variance. The results show that Dep-QMIX-Mask performs better than QMIX, and also outperforms WOA and GWO in terms of SLR, feasibility, and load balancing; it achieves 20 out of 20 feasible sequences, the lowest PUR variance, and a stable average online decision time of 0.12 s. These results indicate that the proposed method delivers strong overall effectiveness for fault reconfiguration and restoration.

4.5. Application of the Dep-QMIX-Mask Method to a New Shipboard MVDC Topology

The previous sections validated the effectiveness of the proposed method in a typical ring-type zonal shipboard MVDC power system topology. However, considering that actual shipboard power systems may differ significantly in topology due to variations in ship type, mission requirements, and design philosophy, it is crucial to evaluate the generalizability of the proposed method across different architectures. Therefore, in this section, the proposed Dep-QMIX-Mask method is applied to a new shipboard MVDC power system topology, which differs in both structure and complexity, in order to systematically verify its robustness, adaptability, and general applicability.

The model of the shipboard MVDC four-zone ring-type power system is shown in Figure 24. The system is powered by four diesel generators through rectifiers to supply the DC bus, which powers 12 loads in the system. The loads are classified into three priority levels, and the bow and stern contact lines can connect the port and starboard sides to ensure flexible power reconfiguration and load restoration in case of faults. Table 9 and Table 10 list the power and priority levels of the loads, as well as the output power of the generators [48].

To apply the proposed method to the new topology, the Dec-POMDP framework defined in Section 2.3 is adopted, and the controllable devices are mapped to four function-specific agents. The Primary Load Agent selects the supply path, either port or starboard, for each primary load. The Secondary Load Agent and the Tertiary Load Agent control the corresponding load switches. The Contact Line Agent manages the switching status of the port–starboard contact lines. Each agent uses an 8-action discrete space, where at each decision step the agent selects one of four controlled components and assigns one of two states, resulting in 4 × 2 = 8 actions. This action encoding and the associated feasibility constraints are consistent with Agent₁–Agent₄ in Section 2.3.

The proposed mapping keeps the observation modalities, such as bus voltages, line statuses, and load power, and the global reward function in Equation (37) unchanged. The only adaptation is the instantiation of the observation vector to match the new topology, which increases the state and observation dimensions. All Dep-QMIX-Mask mechanisms and training hyperparameters remain identical across experiments. Figure 25 presents the training return curve, where the return increases and then stabilizes, indicating convergence to a consistent policy on the new configuration.

Two fault types are considered on the new shipboard topology, and the reconfiguration results are summarized in Table 11. The results show that the proposed strategy restores de-energized loads to the greatest extent while maintaining load balancing, thereby ensuring satisfactory operational quality. Figure 26 compares the system topology immediately after the fault and after reconfiguration for both fault scenarios, and Figure 27 shows that all post-reconfiguration node voltages remain within the allowable range without violations.

Overall, these results further support the scalability and robustness of the proposed method. The multi-agent decomposition avoids a monolithic joint action space by keeping each agent’s action set small, so the online decision cost grows approximately linearly with the number of agents and decision steps, rather than combinatorially with the number of controllable switches. In addition, the combination of dependency correction and action masking enforces switching dependencies and constructs a state-dependent feasible action set, filtering infeasible or redundant operations and reducing wasted exploration. This is particularly beneficial under unseen fault scenarios, where post-fault constraints and feasible switching sequences may differ from those observed during training. When moving from the baseline topology to the new topology with the same algorithmic design and training hyperparameters, Dep-QMIX-Mask consistently produces feasible reconfiguration strategies with strong load restoration and satisfactory load balancing, indicating that the learned coordination generalizes beyond a single fixed architecture.

5. Conclusions and Future Work

This paper proposes a MADRL method that is based on Dep-QMIX-Mask for fault reconfiguration in shipboard MVDC power systems. By considering load priorities, the approach achieves simultaneous optimization of maximizing total load restoration and ensuring load balancing after a fault. First, this paper constructs a Dec-POMDP model for fault reconfiguration in shipboard MVDC power systems. Within this framework, a Dep-QMIX algorithm incorporating a dependency correction mechanism is introduced to capture action dependencies among agents, enhancing collaborative decision-making effectiveness in highly coupled settings. Second, to mitigate the impact of invalid switching actions on learning and execution, an action masking mechanism is introduced. This restricts action selection to feasible action sets, thereby improving training efficiency and enhancing policy execution stability.

Experimental results demonstrate that the proposed method successfully achieves fault isolation and power restoration for de-energized loads across multiple fault scenarios. It simultaneously balances load restoration and load distribution while satisfying system operational constraints, thereby validating its effectiveness for fault reconfiguration in shipboard MVDC power systems. Furthermore, the method consistently yields feasible reconfiguration schemes under diverse fault conditions, indicating its applicability and practical value. However, the proposed method still has several limitations. This work only addresses static system structures and fault scenarios without considering the impact of time-varying factors such as load fluctuations and external disturbances on the fault reconfiguration process during actual operation. Future research can focus on more complex shipboard MVDC power systems, incorporating time-varying operating condition modeling and corresponding decision mechanisms to enhance their adaptability and restoration robustness in dynamic environments.

Author Contributions

Conceptualization, G.Y., X.L., A.S. and M.A.-A.; methodology, G.Y. and X.L.; software, X.L.; validation, G.Y. and X.L.; formal analysis, G.Y. and X.L.; investigation, G.Y. and X.L.; data curation, A.S. and M.A.-A.; writing—original draft preparation, G.Y. and X.L.; writing—review and editing, G.Y., X.L., A.S., M.A.-A. and M.B.; visualization, G.Y. and X.L.; supervision, G.Y., A.S., M.A.-A. and M.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed at the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. Nomenclature and abbreviations.

Symbol	Definition	Symbol	Definition
$N_{load}$	Number of served loads in the power system	$i_{k}^{\max}$	Maximum allowable current on the branch between nodes k and l
$N_{zone}$	Number of zones in the power system	$R_{k l}$	Resistance between nodes k and l
$N_{region}$	Number of zones in the power system	$B_{i j}$	Real part of node admittance matrix element
$n_{1}, n_{2}, n_{3}$	Numbers of supplied Primary, Secondary, and Tertiary loads at time t	$G_{i j}$	Imaginary part of node admittance matrix element
$p_{i}^{load}$	Power of energized load i	$p_{k, t}^{load, 1}$	Served power of Primary loads at time t
$p_{i}^{g}$	Maximum available generator power of zone i	$p_{h, t}^{load, 2}$	Served power of Secondary loads at time t
$q_{i}^{g}$	The reactive power demand at node i	$p_{h, t}^{load, 3}$	Served power of Tertiary loads at time t
$p_{k}^{load}$	Load power at node k	$p_{f, t}^{g}$	Active power output of generator f at time t
$p_{k}^{load, \max}$	Maximum allowable load power at node k	$u_{k, t}$	Generator utilization rate in zone k at time t
$p_{i}^{g, \min}, p_{i}^{g, \max}$	Min/Max allowable active power generation at node i	${\bar{u}}_{t}$	Average utilization rate at time t
$q_{i}^{g, \min}, q_{i}^{g, \max}$	Min/Max allowable reactive power generation at node i	$r_{t}^{load}$	instantaneous load-supply reward at time step t
$v_{i}, v_{j}, v_{k}, v_{l}$	Voltage magnitude at node i, j, k, l	$r_{t}^{node}$	node-voltage violation penalty at time t
$δ_{i}, δ_{j}$	Voltage phase angle at node i, j	$r_{t}^{blance}$	Load-balancing reward at time t
$v_{k}^{\min}, v_{k}^{\max}$	Min/Max allowable voltage at node k	$λ_{v}$	Voltage violation penalty coefficient
$i_{i n, k}$	Incoming current at node k	$λ_{b}$	Load balancing penalty coefficient
$i_{o u t, k}$	Outgoing current at node k	$ψ_{i}$	$Parameters of Q_{i}^{i n d}$
$i_{k}^{load}$	Load current at node k	$ω_{i}$	$Parameters of Δ_{i}^{d e p}$
$i_{k l}$	Current between nodes k and l	$ξ$	Soft update coefficient

Appendix B

Table A2. Glossary.

Symbol	Definition
MVDC	Medium-voltage direct current
Dec-POMDP	Decentralized partially observable Markov decision process
Agent	Decision-making entity controlling a subset of loads or switches
Joint action	Combined action of all agents at one decision step
QMIX	Value-decomposition multi-agent RL with a monotonic mixing network
Dep-QMIX	QMIX with dependency correction for inter-agent action dependencies
Dep-QMIX-Mask	Dep-QMIX with action masking to enforce feasibility online
Dependency correction	Correction that conditions value estimation on preceding agents’ actions
Action masking	State-dependent removal of infeasible or redundant actions
SLR	Served load rate, ratio of served load power to total load power
PUR	Power utilization rate of each zone after reconfiguration
PUR variance	Variance of PUR across zones; smaller values indicate better load balancing
ORR	Optimal reconfiguration results, binary encoding of the supply status of all loads
NFS	Number of feasible solutions across independent runs
WOA	Whale optimization algorithm
GWO	Gray wolf optimizer
p.u.	Per unit value used to normalize electrical quantities
MN	Mixing network that combines per-agent value functions into a global value function
Hypernetwork	Network that generates nonnegative parameters for the mixing network to satisfy monotonicity
Dependency mode	Training mode that conditions value estimation on other agent actions to capture inter-agent dependence
Independency mode	Execution mode where each agent selects actions using only local observations and its learned policy

References

Xu, L.; Guerrero, J.; Lashab, A.; Wei, B.; Bazmohammadi, N.; Vasquez, J.; Abusorrah, A. A Review of DC Shipboard Microgrids—Part I: Power Architectures, Energy Storage, and Power Converters. IEEE Trans. Power Electron. 2022, 37, 5155–5172. [Google Scholar] [CrossRef]
Badakhshan, S.; Rahman, J.; Zhang, J. Black Start of Coastline Power Networks from Grid-Forming Ship-to-Grid Services. IEEE Trans. Smart Grid 2024, 15, 1670–1679. [Google Scholar] [CrossRef]
Xu, Q.; Yang, B.; Han, Q.; Yuan, Y.; Chen, C.; Guan, X. Optimal Power Management for Failure Mode of MVDC Microgrids in All-Electric Ships. IEEE Trans. Power Syst. 2018, 34, 1054–1067. [Google Scholar] [CrossRef]
Zheng, W.; Huang, W.; Hill, D.J. A Deep Learning-Based General Robust Method for Network Reconfiguration in Three-Phase Unbalanced Active Distribution Networks. Int. J. Electr. Power Energy Syst. 2020, 120, 105982. [Google Scholar] [CrossRef]
Mishra, D.K.; Ghadi, M.J.; Azizivahed, A.; Li, L.; Zhang, J. A Review on Resilience Studies in Active Distribution Systems. Renew. Sustain. Energy Rev. 2021, 135, 110201. [Google Scholar] [CrossRef]
Mahdavi, M.; Javadi, M.S.; Wang, F.; Catalao, J.P.S. An Efficient Model for Accurate Evaluation of Consumption Pattern in Distribution System Reconfiguration. IEEE Trans. Ind. Applicat. 2022, 58, 3102–3111. [Google Scholar] [CrossRef]
Ye, Z.; Chen, C.; Wu, K.; Gu, W. Boost Distribution System Restoration with Emergency Communication Vehicles Considering Cyber-Physical Interdependence. IEEE Trans. Smart Grid 2022, 14, 1262–1275. [Google Scholar] [CrossRef]
Shi, Q.; Li, F.; Dong, J.; Olama, M.; Wang, X.; Winstead, C.; Kuruganti, T. Co-Optimization of Repairs and Dynamic Network Reconfiguration for Improved Distribution System Resilience. Appl. Energy 2022, 318, 119245. [Google Scholar] [CrossRef]
Lei, S.; Chen, C.; Li, Y.; Hou, Y. Resilient Disaster Recovery Logistics of Distribution Systems: Co-Optimize Service Restoration with Repair Crew and Mobile Power Source Dispatch. IEEE Trans. Smart Grid 2019, 10, 6187–6202. [Google Scholar] [CrossRef]
Gazijahani, F.S.; Salehi, J.; Shafie-khah, M.; Catalao, J.P.S. Spatiotemporal Splitting of Distribution Networks into Self-Healing Resilient Microgrids Using an Adjustable Interval Optimization. IEEE Trans. Ind. Inf. 2021, 17, 5218–5229. [Google Scholar] [CrossRef]
Zhan, J.; Liu, W.; Chung, C.Y.; Yang, J. Switch Opening and Exchange Method for Stochastic Distribution Network Reconfiguration. IEEE Trans. Smart Grid 2020, 11, 2995–3007. [Google Scholar] [CrossRef]
Civanlar, S.; Grainger, J.J.; Yin, H.; Lee, S.S.H. Distribution Feeder Reconfiguration for Loss Reduction. IEEE Trans. Power Del. 1988, 3, 1217–1223. [Google Scholar] [CrossRef]
Arasteh, H.; Sepasian, M.S.; Vahidinasab, V. An Aggregated Model for Coordinated Planning and Reconfiguration of Electric Distribution Networks. Energy 2016, 94, 786–798. [Google Scholar] [CrossRef]
Azad-Farsani, E.; Sardou, I.G.; Abedini, S. Distribution Network Reconfiguration Based on LMP at DG Connected Busses Using Game Theory and Self-Adaptive FWA. Energy 2021, 215, 119146. [Google Scholar] [CrossRef]
Jafari, A.; Ganjeh Ganjehlou, H.; Baghal Darbandi, F.; Mohammadi-Ivatloo, B.; Abapour, M. Dynamic and Multi-Objective Reconfiguration of Distribution Network Using a Novel Hybrid Algorithm with Parallel Processing Capability. Appl. Soft Comput. 2020, 90, 106146. [Google Scholar] [CrossRef]
Wang, Z.; Hu, Z.; Yang, X. Multi-Agent and Ant Colony Optimization for Ship Integrated Power System Network Reconfiguration. J. Syst. Eng. Electron. 2022, 33, 489–496. [Google Scholar] [CrossRef]
Pathan, M.I.; Al-Muhaini, M.; Djokic, S.Z. Optimal Reconfiguration and Supply Restoration of Distribution Networks with Hybrid Microgrids. Electr. Power Syst. Res. 2020, 187, 106458. [Google Scholar] [CrossRef]
Shariatzadeh, F.; Vellaithurai, C.B.; Biswas, S.S.; Zamora, R.; Srivastava, A.K. Real-Time Implementation of Intelligent Reconfiguration Algorithm for Microgrid. IEEE Trans. Sustain. Energy 2014, 5, 598–607. [Google Scholar] [CrossRef]
Tu, Z.; Zhang, W.; Liu, W. Deep Reinforcement Learning Control for Pulsed Power Load Online Deployment in DC Shipboard Integrated Power System. IEEE Trans. Power Syst. 2022, 38, 3557–3567. [Google Scholar] [CrossRef]
Zhu, J.; Li, D.; Chen, Y.; Chen, J.; Luo, Y. Parallel Hybrid Deep Reinforcement Learning for Real-Time Energy Management of Microgrid. J. Mod. Power Syst. Clean Energy 2024, 13, 991–1002. [Google Scholar] [CrossRef]
Kaewdornhan, N.; Chatthaworn, R. Predictive Energy Management for Microgrid Using Multi-Agent Deep Deterministic Policy Gradient with Random Sampling. IEEE Access 2024, 12, 95071–95090. [Google Scholar] [CrossRef]
Cao, D.; Zhao, J.; Hu, W.; Yu, N.; Ding, F.; Huang, Q.; Chen, Z. Deep Reinforcement Learning Enabled Physical-Model-Free Two-Timescale Voltage Control Method for Active Distribution Systems. IEEE Trans. Smart Grid 2022, 13, 149–165. [Google Scholar] [CrossRef]
Liu, H.; Wu, W. Online Multi-Agent Reinforcement Learning for Decentralized Inverter-Based Volt-VAR Control. IEEE Trans. Smart Grid 2021, 12, 2980–2990. [Google Scholar] [CrossRef]
Zhang, H.; Sun, X.; Lee, M.H.; Moon, J. Deep Reinforcement Learning-Based Active Network Management and Emergency Load-Shedding Control for Power Systems. IEEE Trans. Smart Grid 2024, 15, 1423–1437. [Google Scholar] [CrossRef]
Yi, Z.; Wang, X.; Yang, C.; Yang, C.; Niu, M.; Yin, W. Real-Time Sequential Security-Constrained Optimal Power Flow: A Hybrid Knowledge-Data-Driven Reinforcement Learning Approach. IEEE Trans. Power Syst. 2024, 39, 1664–1680. [Google Scholar] [CrossRef]
Huang, Y.; Li, G.; Chen, C.; Bian, Y.; Qian, T.; Bie, Z. Resilient Distribution Networks by Microgrid Formation Using Deep Reinforcement Learning. IEEE Trans. Smart Grid 2022, 13, 4918–4930. [Google Scholar] [CrossRef]
Zhang, T. A Fault Reconfiguration Strategy Based on Logical Structure and Improved Reinforcement Learning for Ship DC Regional Grid. J. Frankl. Inst. 2024, 361, 107111. [Google Scholar] [CrossRef]
Oh, S.H.; Yoon, Y.T.; Kim, S.W. Online Reconfiguration Scheme of Self-Sufficient Distribution Network Based on a Reinforcement Learning Approach. Appl. Energy 2020, 280, 115900. [Google Scholar] [CrossRef]
Jo, S.; Oh, J.-Y.; Yoon, Y.T.; Jin, Y.G. Self-Healing Radial Distribution Network Reconfiguration Based on Deep Reinforcement Learning. Results Eng. 2024, 22, 102026. [Google Scholar] [CrossRef]
Yao, Y.; Zhang, X.; Wang, J.; Ding, F. Multi-Agent Reinforcement Learning for Distribution System Critical Load Restoration. In Proceedings of the 2023 IEEE Power & Energy Society General Meeting (PESGM); IEEE: Orlando, FL, USA, 2023; pp. 1–5. [Google Scholar]
Si, R.; Chen, S.; Zhang, J.; Xu, J.; Zhang, L. A Multi-Agent Reinforcement Learning Method for Distribution System Restoration Considering Dynamic Network Reconfiguration. Appl. Energy 2024, 372, 123625. [Google Scholar] [CrossRef]
IEEE Std 1709-2018; IEEE Recommended Practice for 1 kV to 35 kV Medium-Voltage DC Power Systems on Ships. IEEE Standards Association: Piscataway, NJ, USA, 2018.
Das, S.; Bose, S.; Pal, S.; Schulz, N.N.; Scoglio, C.M.; Natarajan, B. Dynamic Reconfiguration of Shipboard Power Systems Using Reinforcement Learning. IEEE Trans. Power Syst. 2013, 28, 669–676. [Google Scholar] [CrossRef]
Wang, T.; Liu, W.; Hao, Z. A Decentralized Model-Based Fault Detection and Isolation Scheme for MVDC Shipboard Power Systems. IEEE Trans. Transport. Electrific. 2024, 10, 7804–7815. [Google Scholar] [CrossRef]
Ghione, G.; Randazzo, V.; Pasero, E.; Badami, M. Optimal Cogeneration Scheduling: A Comparison of Genetic and POMDP-Based Deep Reinforcement Learning Approaches. IEEE Access 2025, 13, 128562–128581. [Google Scholar] [CrossRef]
Xiang, J.; Li, Q.; Dong, X.; Ren, Z. Continuous Control with Deep Reinforcement Learning for Mobile Robot Navigation. In Proceedings of the 2019 Chinese Automation Congress (CAC); IEEE: Hangzhou, China, 2019; pp. 1501–1506. [Google Scholar]
Salehi, S.; Iturria-Rivera, P.E.; Elsayed, M.; Bavand, M.; Gaigalas, R.; Ozcan, Y.; Erol-Kantarci, M. Prioritized Value-Decomposition Network for Explainable AI-Enabled Network Slicing. arXiv 2025, arXiv:2501.15734. [Google Scholar]
Li, R.; Xu, L. Active RIS-Assisted Uplink NOMA with MADDPG for Remote State Estimation in Wireless Sensor Networks. Sensors 2025, 25, 4878. [Google Scholar] [CrossRef]
Pan, Y.; Wang, X.; Xu, Z.; Cheng, N.; Xu, W.; Zhang, J. GNN-Empowered Effective Partial Observation MARL Method for AoI Management in Multi-UAV Network. IEEE Internet Things J. 2024, 11, 34541–34553. [Google Scholar] [CrossRef]
Rashid, T.; Samvelyan, M. Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. J. Mach. Learn. Res. 2020, 21, 1–51. [Google Scholar]
Rashid, T.; Farquhar, G.; Peng, B.; Whiteson, S. Weighted QMIX: Expanding Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. Adv. Neural Inf. Process. Syst. 2020, 33, 10199–10210. [Google Scholar]
Gao, F.; Cai, Y.; Yao, H.; Li, S.; Gao, Q.; Yin, L. Factorizing Value Function with Hierarchical Residual Q-Network in Multi-Agent Reinforcement Learning. Neurocomputing 2025, 655, 131340. [Google Scholar] [CrossRef]
Guo, W.; Liu, G.; Zhou, Z.; Wang, L.; Wang, J. Enhancing the Robustness of QMIX against State-Adversarial Attacks. Neurocomputing 2024, 572, 127191. [Google Scholar] [CrossRef]
Do, Q.T.; Hua, T.D.; Tran, A.-T.; Won, D.; Woraphonbenjakul, G.; Noh, W.; Cho, S. Multi-UAV Aided Energy-Aware Transmissions in mmWave Communication Network: Action-Branching QMIX Network. J. Netw. Comput. Appl. 2024, 230, 103948. [Google Scholar] [CrossRef]
Kim, J.I.; Kim, S.B. D-Qmix: Multi-Step Sequential Forward Dynamics Modeling with Global State and Self-Attention for Sample-Efficient Multi-Agent Reinforcement Learning. Inf. Sci. 2025, 729, 122867. [Google Scholar] [CrossRef]
Jiang, K.-Y.; Chiu, W.-Y.; Tsai, Y.-P. Profit Maximization for Electric Vehicle Charging Stations Using Multiagent Reinforcement Learning. Sustain. Energy Grids Netw. 2025, 44, 102009. [Google Scholar] [CrossRef]
Yi, S.; Luo, J. Deep Reinforcement Learning Driven by Heuristics with Petri Nets for Enhancing Real-Time Scheduling in Robotic Job Shops. Rob. Comput. Integr. Manuf. 2026, 97, 103097. [Google Scholar] [CrossRef]
Babaei, M.; Jafari-Marandi, R.; Abdelwahed, S.; Kluss, J. A Novel Approach for Real-Time Implementation of MVDC Shipboard Power System Reconfiguration. Int. J. Electr. Power Energy Syst. 2018, 100, 240–252. [Google Scholar] [CrossRef]

Figure 1. Diagram of the shipboard MVDC power system.

Figure 2. Directed-graph representation of the shipboard MVDC power system.

Figure 3. Dec-POMDP for shipboard MVDC Power System Fault Reconfiguration.

Figure 4. QMIX Algorithm Schematic Diagram [40].

Figure 5. Dep-QMIX Algorithm Schematic Diagram.

Figure 6. Action Masking Process Diagram.

Figure 7. Training Process of the Dep-QMIX-Mask Algorithm.

Figure 8. Mean cumulative reward versus training episodes for different discount factors.

Figure 9. Training Curves of Different Methods.

Figure 10. Comparison of CR Curves between Independency Mode and Dependency Mode.

Figure 11. Electrical Topology of the Reconfiguration Scheme for Case 1: (a) Results After Fault; (b) Results After Reconfiguration.

Figure 12. Node Voltage Per Unit (p.u.) of the Reconfiguration Scheme for Case 1.

Figure 13. Sequential Served Load Rate of Case 1.

Figure 14. Per-step inference time during online decision-making.

Figure 15. Comparison of PUR and Variance of Different Algorithms for Case 1.

Figure 16. Electrical Topology of the Reconfiguration Scheme for Case 2: (a) Results After Fault; (b) Results After Reconfiguration.

Figure 17. Node Voltage Per Unit (p.u.) of the Reconfiguration Scheme for Case 2.

Figure 18. Sequential Served Load Rate of Case 2.

Figure 19. Comparison of PUR and Variance of Different Algorithms for Case 2.

Figure 20. Electrical Topology of the Reconfiguration Scheme for Case 3: (a) Results After Fault; (b) Results After Reconfiguration.

Figure 21. Node Voltage Per Unit (p.u.) of the Reconfiguration Scheme for Case 3.

Figure 22. Sequential Served Load Rate of Case 3.

Figure 23. Comparison of PUR and Variance of Different Algorithms for Case 3.

Figure 24. Shipboard MVDC Power System Model Structure.

Figure 25. Dep-QMIX-Mask Training Curve Under the New Topology.

Figure 26. Electrical Topology of the Reconfiguration Scheme: (a) Results for Case 1; (b) Results for Case 2.

Figure 27. Node Voltage Per Unit (p.u.) of Reconfiguration Schemes: (a) Case 1; (b) Case 2.

Table 1. Summary of the Dec-POMDP specification.

Agent	Action $a_{t}^{i}$	Observation $o_{t}^{i}$	Reward
Agent₁	$a_{t}^{1} = (k, m_{t}), k \in K, m_{t} \in {0, 1}$	$o_{t}^{1} = [{z_{k, t}^{p}, z_{k, t}^{s}, p_{k, t}^{load, 1}, v_{k, t}^{1}}_{k \in K}]$	Global reward $r_{t}$
Agent₂	$a_{t}^{2} = (h, u_{t}^{2}), h \in H, u_{t}^{2} \in {0, 1}$	$o_{t}^{2} = [{z_{h, t}^{2}, p_{h, t}^{load, 2}, v_{h, t}^{2}}_{h \in H}]$
Agent₃	$a_{t}^{3} = (h, u_{t}^{3}), h \in H, u_{t}^{3} \in {0, 1}$	$o_{t}^{3} = [{z_{h, t}^{3}, p_{h, t}^{load, 3}, v_{h, t}^{3}}_{h \in H}]$
Agent₄	$a_{t}^{4} = (h, u_{t}^{4}), h \in H, u_{t}^{4} \in {0, 1}$	$o_{t}^{4} = [{z_{h, t}^{4}}_{h \in H}]$
Agent₅	$a_{t}^{5} = (f, b_{t}), f \in F, b_{t} \in {0, 1, 2}$	$o_{t}^{5} = [{z_{f, t}^{p}, z_{f, t}^{s}, p_{f, t}^{g}, v_{f, t}^{5}}_{f \in F}]$

Note: K indexes the 6 primary loads; H indexes the 6 controllable loads/switches within each category (secondary loads, tertiary loads, and tie-switches); F indexes the 4 generators.

Table 2. Summary of Shipboard Generator Power Parameters.

Generator	kW
G1	1200
G2	600
G3	600
G4	1200

Table 3. Summary of Shipboard Load Parameters.

Load No.	kW	Load Level	Load No.	kW	Load Level
1	72	1	10	44	3
2	120	1	11	225	2
3	200	1	12	205	3
4	150	1	13	110	2
5	160	1	14	72	3
6	100	1	15	87	2
7	80	2	16	100	3
8	325	3	17	205	2
9	185	2	18	200	3

Table 4. Summary of MADRL Hyperparameter Settings.

Parameter	Value
Discount Factor (γ)	0.99
ε Decay	$1 \to 0.05$
Soft Update Coefficient	0.001
Initial Learning Rate	0.008
Batch Size	32
Max Steps per Episode	6
Hidden Layer Size	64
Min Learning Rate (Cosine Annealing)	0.0005
Total Training Epochs	2000
Network Layer Mixing Dimension	64

Table 5. Case 1 Fault Reconfiguration Performance Comparison.

Metric	Dep-QMIX-Mask	QMIX	WOA	GWO
ORR	111111101111111111	111111101111111111	111111101111111111	111111001011111111
SLR	0.88	0.88	0.88	0.83
NFS/20	20/20	20/20	18/20	16/20
Avg. time (s)	0.12	0.12	0.86	0.63
PUR variance	0.0048	0.0291	0.0291	0.0324
Speedup (×) [WOA/GWO]	7.17/5.25	7.17/5.25	1.00/0.73	1.37/1.00

Table 6. Case 2 Fault Reconfiguration Performance Comparison.

Metric	Dep-QMIX-Mask	QMIX	WOA	GWO
ORR	111111001110101110	111111001110101010	111111000110101110	111111000110100110
SLR	0.67	0.63	0.60	0.56
NFS/20	20/20	20/20	17/20	14/20
Avg. time (s)	0.12	0.12	0.84	0.52
PUR variance	0.0025	0.0090	0.0169	0.0042
Speedup (×) [WOA/GWO]	7.00/4.33	7.00/4.33	1.00/0.62	1.62/1.00

Table 7. Case 3 Fault Reconfiguration Performance Comparison.

Metric	Dep-QMIX-Mask	QMIX	WOA	GWO
ORR	111111001000011000	111111001000000100	111111100000000100	111111100000000100
SLR	0.43	0.41	0.37	0.37
NFS/20	20/20	20/20	16/20	15/20
Avg. time (s)	0.12	0.12	0.94	0.64
PUR variance	0.000025	0.0030	0.0009	0.0009
Speedup (×) [WOA/GWO]	7.83/5.33	7.83/5.33	1.00/0.68	1.47/1.00

Table 8. Summary across three representative cases (Cases 1–3): mean ± std (across cases).

Metric	Dep-QMIX-Mask	QMIX	WOA	GWO
SLR	0.66 ± 0.23	0.64 ± 0.24	0.62 ± 0.26	0.59 ± 0.23
NFS/20	20.0 ± 0.0	20.0 ± 0.0	17.0 ± 1.0	15.0 ± 1.0
Avg. time (s)	0.12 ± 0.00	0.12 ± 0.00	0.88 ± 0.05	0.60 ± 0.07
PUR variance (×10⁻³)	2.44 ± 2.39	13.70 ± 13.67	15.63 ± 14.14	12.50 ± 17.31

Table 9. Summary of Shipboard Generator Power Parameters Under the New Topology.

Generator	mW
G1	8
G2	6
G3	6
G4	8

Table 10. Summary of Shipboard Load Parameters Under the New Topology [48].

Load No.	mW	Load Level	Load No.	mW	Load Level
1	1.5	1	7	1.5	2
2	1	1	8	1.5	2
3	4	2	9	1	1
4	2.5	2	10	1.5	1
5	2	1	11	1	1
6	2.5	1	12	3	1

Table 11. Optimal Reconfiguration Scheme Under the New Topology.

Case	Point of Fault	ORR	SLR	PUR	PUR Variance
Case1	V14, V19	110000011111	0.46	0.92/0.83	0.0017
Case2	V15, V18, x24	110110101111	0.65	0.94/0.81	0.0042

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yao, G.; Li, X.; Saim, A.; Ait-Ahmed, M.; Benbouzid, M. Fault Reconfiguration of Shipboard MVDC Power Systems Based on Multi-Agent Reinforcement Learning. J. Mar. Sci. Eng. 2026, 14, 278. https://doi.org/10.3390/jmse14030278

AMA Style

Yao G, Li X, Saim A, Ait-Ahmed M, Benbouzid M. Fault Reconfiguration of Shipboard MVDC Power Systems Based on Multi-Agent Reinforcement Learning. Journal of Marine Science and Engineering. 2026; 14(3):278. https://doi.org/10.3390/jmse14030278

Chicago/Turabian Style

Yao, Gang, Xuan Li, Abdelhakim Saim, Mourad Ait-Ahmed, and Mohamed Benbouzid. 2026. "Fault Reconfiguration of Shipboard MVDC Power Systems Based on Multi-Agent Reinforcement Learning" Journal of Marine Science and Engineering 14, no. 3: 278. https://doi.org/10.3390/jmse14030278

APA Style

Yao, G., Li, X., Saim, A., Ait-Ahmed, M., & Benbouzid, M. (2026). Fault Reconfiguration of Shipboard MVDC Power Systems Based on Multi-Agent Reinforcement Learning. Journal of Marine Science and Engineering, 14(3), 278. https://doi.org/10.3390/jmse14030278

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fault Reconfiguration of Shipboard MVDC Power Systems Based on Multi-Agent Reinforcement Learning

Abstract

1. Introduction

2. Modeling Fault Reconfiguration in Shipboard MVDC Power Systems as a Dec-POMDP

2.1. Shipboard MVDC Power Systems: Structure and Directed-Graph Representation

2.2. Optimization Objectives and Operational Constraints for Fault Reconfiguration in Shipboard MVDC Power Systems

2.2.1. Objective Function

2.2.2. Operational Constraints

2.3. Dec-POMDP Modeling for Fault Reconfiguration in Shipboard MVDC Power Systems

3. Solution to Dec-POMDP Model for Fault Reconfiguration of Marine MVDC Power System Based on Dep-QMIX-Mask

3.1. Fundamental Principles of the QMIX Algorithm

3.2. Description of Agent Decision-Making Dependency Relationships Based on the Deep Dependency Mechanism

3.3. Mask-Based Invalid Action Masking for Agents

4. Case Verification and Results Analysis of Fault Reconfiguration in Shipboard MVDC Power Systems

4.1. Parameter Settings of the Verification Experiment

4.2. Hyperparameter Settings

4.3. Comparative Verification of Training Performance for the Dep-QMIX-Mask Algorithm

4.4. Comparative Validation of Fault Reconfiguration Performance in Shipboard MVDC Power System

4.4.1. Case 1: Post-Fault Reconfiguration in a Shipboard MVDC Power System Under G2 Outage and Line Disconnections x37 and x17

4.4.2. Case 2: Post-Fault Reconfiguration in a Shipboard MVDC Power System Under Generator G1 and G3 Outages

4.4.3. Case 3: Post-Fault Reconfiguration in a Shipboard MVDC Power System Under G1 and G4 Outages

4.5. Application of the Dep-QMIX-Mask Method to a New Shipboard MVDC Topology

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix B

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Load No.	kW	Load Level	Load No.	kW	Load Level
1	72	1	10	44	3
2	120	1	11	225	2
3	200	1	12	205	3
4	150	1	13	110	2
5	160	1	14	72	3
6	100	1	15	87	2
7	80	2	16	100	3
8	325	3	17	205	2
9	185	2	18	200	3

Load No.	kW	Load Level	Load No.	kW	Load Level
1	72	1	10	44	3
2	120	1	11	225	2
3	200	1	12	205	3
4	150	1	13	110	2
5	160	1	14	72	3
6	100	1	15	87	2
7	80	2	16	100	3
8	325	3	17	205	2
9	185	2	18	200	3

Load No.	kW	Load Level	Load No.	kW	Load Level
1	72	1	10	44	3
2	120	1	11	225	2
3	200	1	12	205	3
4	150	1	13	110	2
5	160	1	14	72	3
6	100	1	15	87	2
7	80	2	16	100	3
8	325	3	17	205	2
9	185	2	18	200	3