2.1. Shipboard MVDC Power Systems: Structure and Directed-Graph Representation
According to IEEE standards, shipboard MVDC power systems typically adopt radial or ring-type zonal architectures [
32]. Due to the higher power supply reliability and optimized ship space layout offered by the annular zone structure, the shipboard MVDC power system studied in this paper adopts this configuration, as shown in
Figure 1. The system is supplied by four diesel generators. The generated AC power is converted to medium-voltage DC by rectifiers and fed to the main DC bus, which then supplies the shipboard loads through the electrical distribution network. The primary loads in
Figure 1 have dual supply paths and can be served by either the port or starboard bus. Secondary and tertiary loads have only a single supply path. When generation capacity is insufficient, tertiary and secondary loads can be shed to ensure that primary loads remain energized.
To simplify the analysis, this paper assumes that after a fault occurs in the shipboard MVDC power system shown in
Figure 1, the circuit breakers adjacent to the faulted area trip promptly to isolate the faulted zone.
From a graph-theoretic perspective, busbars, generators, and loads can be modeled as nodes, and power lines as directed edges. Accordingly, the topology of a shipboard MVDC power system can be represented as a directed graph
G = (
V,
x), where
V denotes the set of nodes (system components), and
x denotes the set of edges (power lines) connecting these nodes. Thus, the shipboard MVDC power system shown in
Figure 1 can be represented as a directed graph, as illustrated in
Figure 2. In
Figure 2, light-blue nodes represent medium-voltage DC buses, green nodes represent shipboard loads, and dark-blue nodes represent generators. All belong to the node set
V. Solid lines denote energized lines in normal operation, whereas dashed lines denote de-energized backup lines. All belong to the edge set
x. The light-blue solid line represents the main DC bus line, which serves as the primary path for power transmission. The green line denotes the load line, which controls the connection between loads and the bus. The dark-blue line indicates the generator line, which controls the connection between generators and the port/starboard buses. The orange solid line represents the zone tie line, which connects different zones to regulate power transfer.
For the directed-graph representation of the shipboard MVDC power system shown in
Figure 2, an adjacency matrix
is used to describe the adjacency relationships between nodes. Let
n denote the total number of nodes. Each matrix entry takes a value of 0 or 1, indicating the following connection states between node
i and node
j:
In the shipboard MVDC power system, normally open standby lines (dashed) represent redundant branches in the network topology. In normal operation, these branches remain open. Therefore, the corresponding entries in the adjacency matrix are set to 0, reflecting the actual power transfer paths of the system.
2.3. Dec-POMDP Modeling for Fault Reconfiguration in Shipboard MVDC Power Systems
When faults occur in a shipboard MVDC power system, limitations in communication reliability and control hierarchy make it difficult to rely on a single centralized controller to execute real-time decision-making [
34]. Therefore, the fault reconfiguration process of a shipboard MVDC power system involves multi-module cooperative control. Following a fault occurrence, each control unit can only access local information and must perform decentralized cooperative decision-making under partially observable conditions, with system restoration effectiveness serving as the unified optimization objective. To this end, this paper models the fault reconfiguration process as a cooperative MADRL problem and employs the Dec-POMDP method for modeling. Dec-POMDP is a classical modeling method for describing and characterizing decentralized, partially observable cooperative decision-making problems [
35]. Within this method, each agent selects actions based on local observations. The collective actions of all agents act upon the system, driving updates to the system state. The environment then provides feedback in the form of shared reward signals to guide iterative policy optimization. Dec-POMDP can be represented as the following six-tuple [
36]:
where
S denotes the global state space of the environment;
represents the action space of the agents;
N indicates the number of agents;
T is the transition function, expressed as
, representing the probability of transitioning from state
S to state
S’ at time step
t under the combined actions
of all agents;
O is the observation function for each agent, expressed as
, denoting the probability that an agent observes a local observation given state
S and its own action;
R is the shared reward function, expressed as
, used to evaluate the quality of the collective action
taken at state
;
is the discount factor, used to balance the relationship between immediate rewards and future rewards.
The objective of Dec-POMDP is to obtain an optimal joint policy
that maximizes the expected cumulative reward. The expression for the expected cumulative reward is as follows [
36]:
where
denotes the expected cumulative discounted reward, serving as a measure of the agent’s long-term expected total reward under parameter
θ.
represents the action chosen by agent
i at time
t based on its local observation
through its policy
. To model the uncertainty of faults, this paper randomly selects faults from a predefined set of fault types in the shipboard MVDC power system as the initial conditions for each round, thereby generating different initial states and reconfiguration tasks. Within the Dec-POMDP framework, agents make decisions step-by-step based on local observations within a finite time horizon. To implement this reconfiguration process, this paper constructs a multi-agent system comprising
N = 5 agents based on the aforementioned Dec-POMDP definition. The definitions and settings for each agent and its key components are provided below.
(1) Agent: This paper configures five agents, denoted as Agentn (n = 1…5): Agent1, Agent2, and Agent3 represent primary, secondary, and tertiary load agents, respectively. They control the interconnection switches on the busbar connection lines for each load level, enabling power supply switching for the corresponding loads; Agent4 serves as the zonal tie switch agent, responsible for opening zonal tie switches; Agent5 functions as the generator agent, controlling generator branch tie switches to implement switching of generator supply paths. This decomposition follows common shipboard load management practice, where loads are classified by criticality to support selective shedding and restoration. Tie switch operations change network connectivity and thus determine feasible reconfiguration topologies and power flow paths. Generator branch switching determines source-side connectivity and supply routing under capacity and operating constraints. Separating these decision types into dedicated agents decouples their constraints and action spaces, which reduces decision conflicts among the three decision types.
(2) State: The state space should contain sufficient information to support effective learning and decision optimization for the intelligent agent. The state referred to in this paper denotes the global state of the system, representing the overall operational status of the shipboard MVDC power system during reconfiguration. Based on the definition of the state space
S in Equation (18), this paper constructs the state space using node voltages, branch currents, line states, generator output power, load power, and fault line information, which can be expressed as follows:
where
,
,
,
,
, and
denote the node voltage magnitude, branch current, line status, generator output power, load power, and the fault line information, respectively.
(3) Action: Agents control the on/off state of tie switches to alter the operational status of corresponding lines, thereby achieving power path reconfiguration and optimizing power flow adjustments within the grid. Each round corresponds to a complete fault reconfiguration and restoration process. At discrete decision moments, the system updates its state at time step t. Agents select and execute control actions based on current observations until system restoration is complete.
Agent
1 is responsible for selecting the power supply side for primary loads. As shown in
Figure 1, the system contains K = 6 primary loads. For the load
k, the connection status of its two lines to the port/starboard buses is represented by the binary variables
where 1 indicates on, and 0 indicates off. The action is defined as follows:
where
= 0/1 indicates whether a primary load selects port/starboard as its power reception path, resulting in an action space size of 6 × 2 = 12. After this step, the line status update for the load in the next time step is shown in the following equation:
Agent
2, Agent
3, and Agent
4 are responsible for controlling the on/off status of secondary loads, tertiary loads, and zone tie switches, respectively. As shown in
Figure 1, each category of secondary loads, tertiary loads, and zone tie switches contains H = 6 objects. At each time step
t in the round, each agent performs on/off control only on one of the six objects under its responsibility. For object
h (load or zonal tie switch) controlled by Agent
n, a binary variable
represents the on/off state of the corresponding line at time
t, where 1 indicates on and 0 indicates off. The actions of these three agents are defined as follows:
where
= 1 (or 0) indicates setting the corresponding line to on (or off). Thus, each agent’s action space size is 6 × 2 = 12. After this step, the state update for the selected object’s corresponding line at the next time step is given by the following equation:
At time step
t, Agent
5 decides the power supply direction for a single generator. The system contains
F = 4 generators. Each generator
f has two lines connected to the port and starboard buses, respectively. The connection status of these lines is represented by binary variables
where 1 indicates on and 0 indicates off, and only binary variable combinations {(1,0), (0,1), (1,1)} are permitted, corresponding to power supply to port side, starboard side, or both sides simultaneously. The action is defined as follows:
where
= 0, 1, 2 correspond to port side, starboard side, and both side power supply, respectively, resulting in an action space size of 4 × 3 = 12. After this time step action, the line status of this generator at the next time step is updated according to the following equation:
(4) Observation: The observation space characterizes the local operational information accessible to various agents during system recovery, serving as the basis for agents to formulate control decisions. This paper constructs corresponding observation information sets for each agent. For Agent
1 to Agent
3, their observations include the real-time power of each of the six controlled loads, the states of the lines connecting the controlled loads to the port and starboard buses, and the voltage at the bus node where the load is located. Their observations can be expressed, respectively, as follows:
Agent
4’s observations include the states of the circuit lines controlled by its tie switches; Agent
5’s observations include the power output of its controlled generator, the states of the lines connecting this generator to the port and starboard buses, and the voltage at the bus node where the generator is located. Their observations are represented as follows:
(5) Reward:
Section 2.2 presents the optimization objectives and constraints for fault reconfiguration in shipboard MVDC power systems. Based on these objectives and constraints, the reward function design must guide agents to achieve optimization goals during fault reconfiguration while ensuring system operational safety.
During the interaction between agents and the environment, the environment updates its system state based on the agents’ actions and evaluates the effectiveness of their decisions using a pre-designed reward function. The reward function constructed in this paper comprehensively considers the optimization objectives given by Equations (2)–(4) and the operational constraints specified by Equations (5)–(17). It consists of three components: a maximum load power supply term, a voltage limit violation penalty term, and a load balancing penalty term. These, respectively, characterize the power supply capacity, voltage operational safety, and load distribution balance of the system.
- (a)
Maximum Load Power Supply: Calculate the power of all online loads at each time point. Apply the respective weighting factors
w1,
w2, and
w3 for primary, secondary, and tertiary loads, respectively, to perform a weighted sum. This yields the reward value for that time point. The mathematical expression is as follows:
- (b)
Voltage Limit Violation Penalty: To ensure the safety and stability of system operation, we introduce a node-voltage penalty term to penalize voltages outside the safe range. The voltage limit follows IEEE Std 1709–2018, which specifies a steady-state voltage tolerance of ±10% relative to the nominal value. For a conservative simulation setting, this paper uses a narrower range of 0.95–1.05 p.u. (±5%). The mathematical expression is as follows:
- (c)
Load Balancing Penalty: During the reconfiguration of a shipboard MVDC power system, topology changes may partition the system into multiple zones. The previously defined PUR serves as the balancing metric for this purpose, with its definition provided in Equation (4). The PUR of zone
k at time
t is expressed as follows:
The closer the utilization rate of each zone is to 100%, the more balanced the resource allocation becomes. Therefore, the variance in utilization rates across zones is used to measure the degree of load balancing:
Based on this definition, the load balancing reward can be expressed by the following equation:
Based on the above definitions, the total reward is written as a weighted sum of the three terms.
The normalized terms are given by:
Here, is set to the restoration reward of a representative secondary load to capture the typical scale of load restoration. The voltage reference is set to the penalty magnitude when the voltage-violation level reaches the minimum level of concern p.u. The reference imbalance level is obtained from simulation statistics, and is used. Accordingly, is set to the penalty magnitude at . Based on these normalized terms, the weights are chosen as , and this ratio is selected to impose a clear priority among the three objectives. Load restoration is treated as the main performance target during reconfiguration. Voltage security is a strict operational requirement in shipboard MVDC systems, since voltage limit violations may trigger cascading effects or severe failures. Therefore, the voltage term is assigned a substantial weight to strongly discourage violations. The balancing term is kept smaller so that it mainly guides resource allocation among feasible actions without dominating the reward.
Table 1 summarizes the key elements of the Dec-POMDP model, including the agents, their actions, local observations, and the shared reward.
Based on the aforementioned Dec-POMDP definition and agent configuration,
Figure 3 illustrates the interaction process between multi-agents and the shipboard MVDC power system: At time step
t, each agent obtains local observations
according to Equations (27)–(31). and selects action
via policy
π; After executing actions, the environment updates its state according to Equation (20), returns new observations, and generates instantaneous rewards
using Equation (37). This paper adopts maximizing cumulative reward as the optimization objective. The cumulative reward is obtained by weighting and summing the rewards from each time step according to Equation (19), thereby driving the iterative update of the policy.