Dynamic Leader Election and Model-Free Reinforcement Learning for Coordinated Voltage and Reactive Power Containment Control in Offshore Island AC Microgrids

Ye, Xiaolu; Wang, Zhanshan; Wang, Qiufu; Wang, Shuran

doi:10.3390/jmse13081432

Open AccessArticle

Dynamic Leader Election and Model-Free Reinforcement Learning for Coordinated Voltage and Reactive Power Containment Control in Offshore Island AC Microgrids

¹

College of Information Science and Engineering, Northeastern University, Shenyang 110819, China

²

State Grid Jilin Electric Power Co., Ltd. Changchun Power Supply Company, Changchun 130000, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2025, 13(8), 1432; https://doi.org/10.3390/jmse13081432

Submission received: 29 May 2025 / Revised: 22 July 2025 / Accepted: 26 July 2025 / Published: 27 July 2025

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

Island microgrids are essential for the exploitation and utilization of offshore renewable energy resources. However, voltage regulation and accurate reactive power sharing remain significant technical challenges that need to be addressed. To tackle these issues, this paper proposes an algorithm that integrates a dynamic leader election (DLE) mechanism and model-free reinforcement learning (RL). The algorithm aims to address the issue of fixed leaders restricting reactive power flow between buses during heavy load variations in island microgrids, while also overcoming the challenge of obtaining model parameters such as resistance and inductance in practical microgrids. First, we establish a voltage containment control and reactive power error model for island alternating current (AC) microgrids and construct a corresponding value function based on this error model. Second, a dynamic leader election algorithm is designed to address the issue of fixed leaders restricting reactive power flow between buses due to preset voltage limits under unknown or heavy load conditions. The algorithm adaptively selects leaders based on bus load, allowing the voltage limits to adjust accordingly and regulating reactive power flow. Then, to address the difficulty of accurately acquiring parameters such as resistance and inductance in microgrid lines, a model-free reinforcement learning method is introduced. This method relies on real-time measurements of voltage and reactive power data, without requiring specific model parameters. Ultimately, simulation experiments on offshore island microgrids are conducted to validate the effectiveness of the proposed algorithm.

Keywords:

island AC microgrids; dynamic leader election; voltage control; containment control; model-free reinforcement learning

1. Introduction

Renewable generation is playing an increasingly important role in the development and utilization of marine resources [1,2]. Specifically, the proportion of renewable generation units in offshore island microgrids is higher than that in land-based microgrids. These renewable generation units are integrated into the grid via inverters that convert direct current (DC) into AC [3]. Voltage serves as a crucial parameter in island AC microgrid operation [4]. Proper reactive power allocation affects not only power balance but also voltage regulation and line loss optimization. Since island microgrids are not connected to the main grid, voltage and reactive power regulation depend entirely on internal coordinated control [5]. Additionally, island microgrids operate under harsh marine climatic conditions, which pose significant challenges to their operational security. Given their important role in promoting the development of marine resources, voltage regulation and reactive power sharing have become key issues that require extensive research [6].

Voltage regulation and reactive power sharing have been extensively investigated in previous research. The average voltage regulation method in [7] ensures accurate reactive power sharing but may violate voltage safety constraints under load fluctuations. The weighted coefficient method in [8] maintains voltage within safe limits but relies on empirical tuning, limiting scalability. The optimization-based approach in [9] achieves precise control but is computationally intensive and hard to implement in real time. These limitations hinder the practical deployment of these methods in island microgrids.

To address the conflict between voltage regulation and reactive power sharing in island AC microgrids, containment control has gradually attracted increasing attention. In [10], the use of containment control was first proposed to balance voltage regulation and reactive power sharing in microgrids, laying a theoretical foundation for subsequent research in this field. The application of containment control was further extended in [11] to address voltage regulation issues among multiple interconnected microgrids, which significantly improved the performance and stability of coordinated multi-microgrid operation. In [12], the discussion focuses on a containment control-based strategy for voltage regulation in microgrids facing communication and sensor failures. However, in the above methods, the leaders for containment control are typically predetermined. When the bus associated with the upper-bound leader experiences a heavy load, its voltage may drop significantly, which contradicts the assumption that the upper leader should always maintain the highest voltage. This situation leads to ineffective reactive power transfer among buses. The fixed leader configuration lacks flexibility and reduces the adaptability of the island microgrid under dynamic operating conditions.

In island microgrids that have already been put into operation, it is difficult to accurately obtain model parameters such as resistance and inductance due to the limited precision of measurement devices and the influence of operating conditions (such as temperature variations). In contrast, data-driven approaches that do not rely on explicit model parameters can directly utilize operational data from the microgrid for controller design. In [13], a deep learning-based secondary controller for microgrids is designed using historical data. In [14], Koopman’s operator theory enables voltage control based on input–output data. In [15], a data-driven distributed predictive control method achieves voltage restoration and current sharing via an incremental linear model. In [16], least squares and Gaussian process regression are used to learn system sensitivity and estimate modeling errors, ensuring optimal and safe microgrid control. However, the method in [13] requires processing a large quantity of offline data, while the approach in [15] has high computational complexity and slow convergence of control performance, making it difficult to satisfy the real-time requirements of microgrid control. The methods in [14,16] rely heavily on prior knowledge and historical data. These issues limit the widespread application of data-driven methods in practical island microgrids.

Based on the above discussion, this paper proposes an island microgrid voltage regulation and reactive-power-sharing control strategy that combines a dynamic leader election algorithm with a model-free reinforcement learning algorithm. The proposed strategy utilizes a leader election algorithm to dynamically adjust the leader roles according to the load conditions of each bus in the island microgrid: the distributed generation (DG) corresponding to the bus with a higher load is set as the lower-bound leader, while the DG associated with the bus with a lighter load is set as the upper-bound leader. In this way, flexible reactive power flow among buses is promoted, enabling precise reactive power sharing. Meanwhile, by designing value functions for voltage and reactive power errors and employing a model-free reinforcement learning algorithm, the controller is designed based solely on island microgrid operational data without requiring any model information. Furthermore, this paper theoretically proves the convergence of the leader election algorithm and the optimality of the policy iteration algorithm in model-free reinforcement learning. Lastly, the proposed strategy is validated through a series of simulation experiments. In the experiments, the effectiveness of the proposed method is validated in three distinct case studies, which confirm that it restores the voltage of island AC microgrids to the reference range set by containment control and accomplishes accurate reactive power sharing. The main contributions of this research are summarized below:

To address the limitations of containment control methods based on fixed leaders, as proposed in [10,11], which struggle with complex scenarios like sudden large load changes, this paper introduces a novel DLE algorithm. Unlike the static nature of fixed-leader approaches, our DLE mechanism is based on bus voltage estimation, allowing each DG to dynamically select the leader according to the relative magnitude of the estimated voltages. This adaptive capability enables accurate reactive power sharing even under sudden load changes or large load fluctuations, significantly enhancing the microgrid’s flexibility.
To overcome the challenges of model-based controller design, as highlighted in [17,18], where obtaining practical parameters like resistance and inductance is difficult, this paper proposes a data-driven online reinforcement learning approach. In contrast to model-based methods that are sensitive to parameter uncertainties and measurement errors, our algorithm does not require extensive offline data processing. The control policy is iteratively optimized online by minimizing a value function, enabling accurate reactive power sharing and effective voltage control in the microgrid without relying on a precise system model.

The structure of this paper is as follows: The necessary background and preliminaries are outlined in Section 2. Section 3 proves the convergence of policy iteration, proposes a dynamic leader election algorithm, and designs a model-free reinforcement learning algorithm. Section 4 validates the effectiveness of the proposed methods through numerical experiments. Finally, Section 5 summarizes the main contributions of this paper.

2. Preliminaries and Problem Formulation

This section presents the graph theory, island microgrid modeling, analysis of reactive power and voltage coupling, and the formulation of performance indices for containment control.

2.1. Graph Theory

This paper considers an island AC microgrid modeled as a multi-agent system (MAS), comprising N follower agents and M virtual leader agents. The MAS is represented by an undirected graph

Ψ = (V, ℑ)

, where

V = \{v_{0}, v_{1}, \dots, v_{N}\}

denotes the node set, while

ℑ \subseteq V \times V

corresponds to the collection of edges. Two nodes

v_{i}

and

v_{j}

are considered neighbors if

(v_{i}, v_{j}) \in ℑ

. For the graph

Ψ

,

A = [a_{i j}] \in R^{N \times N}

is the adjacency matrix, where

a_{i j} = 1

if

(v_{i}, v_{j}) \in ℑ

; otherwise,

a_{i j} = 0

. The degree matrix is

D = diag \{d_{i}\}

, where

d_{i} = \sum_{j \in N_{i}} a_{i j}

is the degree of the ith node in the graph, and

N_{i}

represents the set of neighbors of node

v_{i}

. The Laplacian matrix L is then defined by

L = D - A

. Matrix

S_{r} = diag \{s_{1}^{r}, \dots, s_{i}^{r}, \dots, s_{N}^{r}\} \in R^{N \times N}

(

i = 1, \dots, N, r = 1, \dots, M

) is the pinning gain matrix associated with the rth virtual leader, where

s_{i}^{r} = 1

if the rth virtual leader can communicate with the ith follower; otherwise,

s_{i}^{r} = 0

.

2.2. Model Descriptions of Island AC Microgrids

Assume that all DGs are interfaced with the island microgrid through voltage source inverters (VSIs), each equipped with an output LC filter, as illustrated in Figure 1. By applying feedback linearization [17], the following relationship is established:

{\ddot{\tilde{z}}}_{i} = L_{F_{i}}^{2} h_{i} + L_{g_{i}} L_{F_{i}} ({\tilde{℘}}_{i}) h_{i} {\tilde{V}}_{i} + \ddot{v} = υ_{i} + \ddot{v},

(1)

where

F_{i} ({\tilde{℘}}_{i}) = f_{i} ({\tilde{℘}}_{i}) + k_{i} ({\tilde{℘}}_{i}) D_{i}

, and

L_{F_{i}}^{2} h_{i} = L_{F_{i}} (L_{F_{i}} h_{i}) = \frac{\partial L_{F_{i}} h_{i}}{\partial {\tilde{℘}}_{i}} F_{i} \cdot v

. For the sake of brevity, while the complete derivation is detailed in [11], the key terms are briefly defined here.

{\tilde{℘}}_{i}

is the state vector of the ith DG, which includes filter currents and capacitor voltages in the dq-frame.

h_{i}

is the output function, defined as the bus voltage

v_{b d i}

.

f_{i}

and

g_{i}

represent the drift and input vector fields of the nonlinear system, respectively.

L_{F_{i}} h_{i}

and

L_{F_{i}}^{2} h_{i}

are the first- and second-order Lie derivatives of the output function with respect to the system dynamics.

υ_{i}

is the new, linearized control input. According to [11],

Ξ_{i}

is used to denote the bus voltage

v_{b d i}

. Since the control of microgrids is typically implemented within a digital control framework, it is necessary to discretize the aforementioned continuous-time equations. After discretization, the variable

Ξ_{i}

in Equation (1) can be reformulated as follows:

Ξ_{i} (k + 1) = A_{d} Ξ_{i} (k) + B_{d} u_{i} (k), i = 1, \dots, N

(2)

where the new control input is

u_{i} = υ_{i} + \ddot{v}

. The state variable is defined by

Ξ_{i} = {[v_{b d i}, {\dot{v}}_{b d i}]}^{T}

. The discrete-time system matrices

A_{d}

and

B_{d}

are derived from A and B (as described in [11]) using the zero-order hold method with sampling interval

T_{s}

, where

A_{d} = e^{A T_{s}}

and

B_{d} = \int_{0}^{T_{s}} e^{A τ} B d τ

[19]. Thus,

A_{d} \approx (\begin{matrix} 1 & T_{s} \\ 0 & 1 \end{matrix})

and

B_{d} \approx (\begin{matrix} T_{s}^{2} / 2 \\ T_{s} \end{matrix})

.

The linearized system uses the state vector

Ξ_{i}

to describe the voltage dynamics, where

v_{b d i}

and its derivative

{\dot{v}}_{b d i}

are captured.

2.3. Reactive Power Sharing and Voltage Regulation

Under islanded conditions, each DG applies the conventional

Q - V

droop control [20], i.e.,

E_{i} = {\tilde{V}}_{i} - n_{i} Q_{i}

, where

E_{i}

and

{\tilde{V}}_{i}

are the voltage and voltage magnitude references,

n_{i}

is the droop coefficient, and

Q_{i}

is the reactive power output of the ith DG. The objective of reactive power sharing is [21]:

n_{1} Q_{1} = n_{2} Q_{2} = \dots = n_{i} Q_{i}, i = 1, 2, \dots, N

.

Following the conventional Kron reduction based on steady-state parameters [22], the reduced bus admittance matrix

Y

is obtained. The reactive power at bus i is given by [9]:

Q_{i} = v_{b d i} \sum_{j \in N_{i}} v_{b d j} (G_{i j} sin θ_{i j} - B_{i j} cos θ_{i j})

(3)

where

G_{i j}

and

B_{i j}

are the real and imaginary parts of

Y

between buses i and j (

Y = G + j B

),

θ_{i j} = θ_{i} - θ_{j}

is the phase angle difference, and

N_{i}

is the set of buses connected to i (including i itself). Assuming small power angles (

sin θ_{i j} \approx θ_{i j}

,

cos θ_{i j} \approx 1

) [23] and predominantly inductive feeder impedance (

R / X ≪ 1

) [24], (3) simplifies to

Q_{i} = - \sum_{j = 1}^{n} B_{i j} v_{b d j}

. Letting

\tilde{Q} = [Q, \dot{Q}]

and combining with (2), the discrete-time dynamics of

{\tilde{Q}}_{i}

are:

{\tilde{Q}}_{i} (k + 1) = A_{d} {\tilde{Q}}_{i} (k) - B_{d} (\sum_{j = 1}^{n} B_{i j} u_{j})

(4)

Accurate reactive power sharing is difficult due to the coupling between voltage regulation and reactive power [10]. Tight voltage control limits reactive power exchange and leads to sharing imbalance. To overcome this, a containment control strategy is used to maintain voltage within set bounds. The boundary dynamics are given by:

Ξ_{0}^{r} (k + 1) = A_{d} Ξ_{0}^{r} (k), r = 1, 2 .

(5)

where

ω_{0}^{r} = (\begin{matrix} V_{r} \\ 0 \end{matrix})

,

r = 1, 2

. Here,

V_{1}

and

V_{2}

denote the upper and lower voltage reference limits. According to MAS theory, the neighborhood containment error

ϱ_{i}^{v} (k)

is given by:

ϱ_{i}^{v} (k) = \sum_{j \in N_{i}} a_{i j} (Ξ_{i} (k) - Ξ_{j} (k)) + \sum_{r = 1}^{2} s_{i}^{r} (Ξ_{0}^{r} (k) - Ξ_{i} (k)), i = 1, \dots, N .

(6)

Based on (2), (5), and (6), the dynamic equation satisfied by the containment voltage error can be derived as follows:

ϱ_{i}^{v} (k + 1) = A_{d} ϱ_{i}^{v} (k) - (d_{i} + \sum_{r = 1}^{2} s_{i}^{r}) B_{d} u_{i} (k) + \sum_{j \in N_{i}} a_{i j} B_{d} u_{j} (k)

(7)

The reactive power sharing error

ϱ_{i}^{q} (k)

quantifies deviations in the reactive power contribution of DGs. It is defined by:

ϱ_{i}^{q} (k) = \sum_{j \in N_{i}} a_{i j} (n_{i} {\tilde{Q}}_{i} (k) - n_{j} {\tilde{Q}}_{j} (k))

(8)

According to (4), the dynamic equation of the reactive-power-sharing error is given by:

ϱ_{i}^{q} (k + 1) = A_{d} ϱ_{i}^{q} (k) - B_{d} \sum_{j \in N_{i}} a_{i j} [n_{i} \sum_{m = 1}^{n} B_{i m} u_{m} - n_{j} \sum_{w = 1}^{n} B_{j w} u_{w}]

(9)

By minimizing these errors, the proposed control strategy achieves both containment-based voltage regulation and accurate reactive power sharing.

Assumption A1.

For any virtual leader, one or more paths exist that connect its dynamic behavior to every follower DG in the network.

2.4. Optimal Performance Metrics

Each DG i optimizes its cost via a game, using a local performance index as in [25] to ensure proper power sharing, low energy consumption, and voltage security:

J_{i} (ϱ_{i}^{v} (k), u_{i} (k), ϱ_{i}^{q} (k)) = \sum_{t = k}^{\infty} β^{t - k} M_{i} (ϱ_{i}^{v} (t), u_{i} (t), ϱ_{i}^{q} (t))

(10)

where

M_{i} (ϱ_{i}^{v} (k), u_{i} (k), ϱ_{i}^{q} (k)) = ϱ_{i}^{v T} (k) Θ_{1 i i} ϱ_{i}^{v} (k) + u_{i}^{T} (k) Θ_{2 i i} u_{i} (k) + ϱ_{i}^{q T} (k) Θ_{3 i i} ϱ_{i}^{q} (k)

, and

Θ_{1 i i} > 0

,

Θ_{2 i i} > 0

,

Θ_{3 i i} > 0

are all positive weighting matrices.

β

denotes the discount factor, satisfying

0 < β \leq 1

. Each DG optimizes its control strategy locally through communication with neighbors to achieve system objectives.

Definition 1

([25]). The control action

u_{i} (k)

is considered admissible if it stabilizes Equations (7) and (9) and ensures that

J_{i}

remains bounded.

For any admissible control policy

u_{i} (k)

, the local performance function

V_{i} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k))

of the ith DG can be expressed as

V_{i} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k)) = M_{i} (ϱ_{i}^{v} (k), u_{i} (k), ϱ_{i}^{q} (k)) + β V_{i} (ϱ_{i}^{v} (k + 1), ϱ_{i}^{q} (k + 1))

by applying the Bellman optimality principle. Specifically, the optimal local performance function is given by:

V_{i}^{*} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k)) = min_{u_{i} (k)} \{M_{i} (ϱ_{i}^{v} (k), u_{i} (k), ϱ_{i}^{q} (k)) + β V_{i}^{*} (ϱ_{i}^{v} (k + 1), ϱ_{i}^{q} (k + 1))\}

(11)

where

V_{i}^{*}

denotes the optimal value function, subject to the boundary condition

V_{i}^{*} (0, 0) = 0

. Equation (11) represents the HJB equation. Accordingly,

u_{i}^{*} (k)

represents the local containment control input that achieves optimality for the microgrid, and its derivation is provided below:

u_{i}^{*} (k) = - \frac{β}{2} Θ_{2 i i}^{- 1} [\frac{\partial V_{i}^{*}}{\partial ϱ_{i}^{v} (k + 1)} (- (d_{i} + \sum_{r = 1}^{2} s_{i}^{r}) B_{d}) + \frac{\partial V_{i}^{*}}{\partial ϱ_{i}^{q} (k + 1)} (- B_{d} \sum_{j \in N_{i}} a_{i j} (n_{i} B_{i i} - n_{j} B_{j i}))]

(12)

Remark 1.

While conventional model-based methods can theoretically compute the optimal control

u_{i}^{*} (k)

by solving the HJB Equation (12), their practical application is hindered by a significant limitation: the reliance on precise microgrid parameters that are often unavailable or uncertain in real-world scenarios. To address this fundamental challenge, this paper proposes a model-free reinforcement learning approach. Instead of requiring an explicit system model, this method approximates the HJB solution and derives the optimal policy directly from input–output data using an actor–critic framework. This data-driven nature represents a key advantage, enhancing the controller’s robustness and practical value compared to conventional methods that depend on an idealized and often inaccurate system model.

3. Coordinated Voltage and Reactive Power Control Scheme Design Based on DLE and RL Algorithms

This section describes the coordinated control scheme for voltage and reactive power based on DLE and RL algorithms. Figure 2 shows the overall control procedure.

3.1. Convergence Analysis of Policy Iteration

The iterative learning algorithm is applied to the containment controller as an optimization method using historical data. Each DG exchanges voltage and reactive power information with others via the communication network. A time sequence

t_{1}, t_{2}, \dots

with interval

s = t_{k + 1} - t_{k}

is defined. In policy iteration, the performance function is evaluated for a feasible policy and, as s increases, both performance function

V_{i}^{s} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k))

and control policy

u_{i}^{s} (k)

are iteratively updated.

Step 1: Initialize

u_{i}^{0} (k)

and

V_{i}^{0} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k))

= 0;

Step 2: Update the performance function

V_{i}^{s} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k))

;

Step 3: Update the control actions

u_{i}^{s + 1} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k))

;

Step 4: The algorithm terminates, while

∥V_{i}^{s + 1} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k)) - V_{i}^{s} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k))∥ \leq ι

, where

ι

is a predefined positive constant. The iteration index s is updated to

s + 1

, and the process returns to Step 2 for further iteration.

The objective is to guarantee convergence of both the control strategy and the local performance function to their respective optimal values. To establish

u_{i}^{s} (k) \to u_{i}^{*} (k)

and

V_{i}^{s} (ξ_{i} (k)) \to V_{i}^{*} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k))

as

s \to \infty

, an essential lemma is presented below.

Lemma 1

([26]). Startingfrom any initial admissible control policies

u_{i}^{0} (k)

,

V_{i}^{s} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k))

and

u_{i}^{s} (k)

are updated iteratively via Steps 2 and 3. It can be shown that

V_{i}^{s} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k))

is monotonically nonincreasing, i.e.,

V_{i}^{s + 1} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k)) \leq V_{i}^{s} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k))

.

Theorem 1.

Let

V_{i}^{s} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k))

and

u_{i}^{s} (k)

be generated by Step 2 and Step 3. As

s \to \infty

,

V_{i}^{s} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k))

converges to the optimal value

V_{i}^{*} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k))

, and

u_{i}^{s} (k)

converges to the optimum

u_{i}^{*} (k)

, i.e.,

{lim}_{s \to \infty} V_{i}^{s} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k)) = V_{i}^{*} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k)), {lim}_{s \to \infty} u_{i}^{s} (k) = u_{i}^{*} (k) .

Proof.

Let

V_{i}^{s} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k))

denote the value function at iteration s, and define its pointwise limit as

V_{i}^{\infty} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k)) = {lim}_{s \to \infty} V_{i}^{s} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k)) .

By Step 2 and Step 3, for all s, the following recursion holds:

V_{i}^{s} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k)) = min_{u_{i} (k)} \{M_{i} (ϱ_{i}^{v} (k), u_{i} (k), ϱ_{i}^{q} (k)) + β V_{i}^{s} (ϱ_{i}^{v} (k + 1), ϱ_{i}^{q} (k + 1))\}

(13)

First, we note that for any

ϵ > 0

, there exists an integer

s_{0}

such that for all

s \geq s_{0}

,

|V_{i}^{s} (ϱ_{i}^{v} (k + 1), ϱ_{i}^{q} (k + 1)) - V_{i}^{\infty} (ϱ_{i}^{v} (k + 1), ϱ_{i}^{q} (k + 1))| < ϵ .

(14)

For any admissible

u_{i} (k)

, as

s \to \infty

, it follows that

V_{i}^{\infty} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k)) \leq min_{u_{i} (k)} \{M_{i} (ϱ_{i}^{v} (k), u_{i} (k), ϱ_{i}^{q} (k)) + β V_{i}^{\infty} (ϱ_{i}^{v} (k + 1), ϱ_{i}^{q} (k + 1))\} + 2 ϵ

(15)

For

s \to \infty

, we have

V_{i}^{\infty} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k)) \geq min_{u_{i} (k)} \{M_{i} (ϱ_{i}^{v} (k), u_{i} (k), ϱ_{i}^{q} (k)) + β V_{i}^{\infty} (ϱ_{i}^{v} (k + 1), ϱ_{i}^{q} (k + 1))\} - ϵ

(16)

Combining inequalities (15) and (16), for any

ϵ > 0

, we have

\begin{matrix} min_{u_{i} (k)} \{M_{i} (ϱ_{i}^{v} (k), u_{i} (k), ϱ_{i}^{q} (k)) + β V_{i}^{\infty} (ϱ_{i}^{v} (k + 1), ϱ_{i}^{q} (k + 1))\} - ϵ \\ \leq V_{i}^{\infty} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k)) \leq min_{u_{i} (k)} \{M_{i} (ϱ_{i}^{v} (k), u_{i} (k), ϱ_{i}^{q} (k)) + β V_{i}^{\infty} (ϱ_{i}^{v} (k + 1), ϱ_{i}^{q} (k + 1))\} + 2 ϵ \end{matrix}

(17)

Since

ϵ > 0

is arbitrary, by letting

ϵ \to 0

, we conclude that

V_{i}^{\infty} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k)) = min_{u_{i} (k)} \{M_{i} (ϱ_{i}^{v} (k), u_{i} (k), ϱ_{i}^{q} (k)) + β V_{i}^{\infty} (ϱ_{i}^{v} (k + 1), ϱ_{i}^{q} (k + 1))\}

(18)

For any admissible control

u_{i} (k)

, a new performance index can be used to equivalently express the problem:

Ψ_{i} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k)) = M_{i} (ϱ_{i}^{v} (k), u_{i} (k), ϱ_{i}^{q} (k)) + β Ψ_{i} (ϱ_{i}^{v} (k + 1), ϱ_{i}^{q} (k + 1))

(19)

Furthermore, assume there exists a state

({\bar{ϱ}}_{i}^{v}, {\bar{ϱ}}_{i}^{q})

such that

Ψ_{i} ({\bar{ϱ}}_{i}^{v}, {\bar{ϱ}}_{i}^{q}) < V_{i}^{\infty} ({\bar{ϱ}}_{i}^{v}, {\bar{ϱ}}_{i}^{q})

. By recursively unfolding the definition of

Ψ_{i}

, for a finite horizon N (and noting that the terminal cost vanishes as

(ϱ_{i}^{v} (N), ϱ_{i}^{q} (N)) \to 0

), we obtain

Ψ_{i} (ϱ_{i}^{v} (0), ϱ_{i}^{q} (0)) = \sum_{t = 0}^{N - 1} β^{t} M_{i} (ϱ_{i}^{v} (t), u_{i} (t), ϱ_{i}^{q} (t)) .

(20)

According to the definition of

V_{i}^{\infty} (ϱ_{i}^{v} (0), ϱ_{i}^{q} (0))

, we have

V_{i}^{\infty} (ϱ_{i}^{v} (0), ϱ_{i}^{q} (0)) = min_{u_{i} (0), \dots, u_{i} (N - 1)} \sum_{t = 0}^{N - 1} β^{t} M_{i} (ϱ_{i}^{v} (t), u_{i} (t), ϱ_{i}^{q} (t)) .

(21)

By the principle of optimality,

V_{i}^{\infty} (ϱ_{i}^{v} (0), ϱ_{i}^{q} (0))

is the minimal cost. Therefore,

V_{i}^{\infty} (ϱ_{i}^{v} (0), ϱ_{i}^{q} (0)) \leq \sum_{t = 0}^{N - 1} β^{t} M_{i} (ϱ_{i}^{v} (t), u_{i} (t), ϱ_{i}^{q} (t)) = Ψ_{i} (ϱ_{i}^{v} (0), ϱ_{i}^{q} (0))

(22)

This contradicts our previous assumption. Thus, it must hold that

Ψ_{i} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k)) \geq V_{i}^{\infty} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k))

for all k.

V_{i}^{\infty} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k))

serves as a global lower bound on the cost for any admissible policy, with equality attained under the optimal policy, i.e.,

V_{i}^{\infty} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k)) = Ψ_{i}^{*} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k)) = V_{i}^{*} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k)) .

(23)

Similarly, it holds that

V_{i}^{s} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k)) \geq V_{i}^{*} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k))

for any iteration s. Taking the limit as

s \to \infty

, we have

V_{i}^{\infty} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k)) \geq V_{i}^{*} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k))

. On the other hand, by the definition of

V_{i}^{*} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k))

as the minimal cost achievable by any admissible policy,

V_{i}^{\infty} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k))

cannot be smaller than

V_{i}^{*} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k))

. Therefore, the following equality holds:

{lim}_{s \to \infty} V_{i}^{s} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k)) = V_{i}^{*} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k))

.

This completes the proof. □

This algorithm ensures voltage convergence to the optimal values under containment control and achieves accurate reactive power sharing.

3.2. Stability Analysis of Coordinated Voltage and Reactive Power Control

Section 3.1 proved that the policy iteration algorithm converged to the optimal control policy

u_{i}^{*} (k)

. This section demonstrates that the application of this optimal policy ensures the asymptotic stability of the closed-loop system. Specifically, we prove that the containment voltage error

ϱ_{i}^{v} (k)

and the reactive-power-sharing error

ϱ_{i}^{q} (k)

converge to zero.

To analyze the stability, we employ a Lyapunov-based approach. The optimal value function

V_{i}^{*} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k))

derived from the Bellman equation serves as a natural candidate for a Lyapunov function for the closed-loop error dynamics of the ith DG.

Theorem 2.

For the error dynamics described by (7) and (9), if the control policy

u_{i}^{*} (k)

is obtained from the converged policy iteration algorithm as described in Section 3.1, then the closed-loop system is asymptotically stable at the origin, i.e.,

{lim}_{k \to \infty} ϱ_{i}^{v} (k) = 0

and

{lim}_{k \to \infty} ϱ_{i}^{q} (k) = 0

.

Proof.

The optimal value function

V_{i}^{*} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k))

satisfies the Bellman optimality equation for the optimal policy

u_{i}^{*} (k)

:

V_{i}^{*} (ϱ_{i}^{v} (k), ϱ_{i}^{q} (k)) = M_{i} (ϱ_{i}^{v} (k), u_{i}^{*} (k), ϱ_{i}^{q} (k)) + β V_{i}^{*} (ϱ_{i}^{v} (k + 1), ϱ_{i}^{q} (k + 1))

(24)

where

M_{i} (ϱ_{i}^{v} (k), u_{i}^{*} (k), ϱ_{i}^{q} (k)) = ϱ_{i}^{v T} (k) Θ_{1 i i} ϱ_{i}^{v} (k) + u_{i}^{* T} (k) Θ_{2 i i} u_{i}^{*} (k) + ϱ_{i}^{q T} (k) Θ_{3 i i} ϱ_{i}^{q} (k)

.

According to Definition 1, for any admissible control policy, the performance index

J_{i}

must be bounded. The optimal policy

u_{i}^{*} (k)

is, by definition, an admissible policy. Therefore, the optimal value function

V_{i}^{*} (ϱ_{i}^{v} (0), ϱ_{i}^{q} (0))

, which is the minimum possible value of

J_{i}

, must be finite.

V_{i}^{*} (ϱ_{i}^{v} (0), ϱ_{i}^{q} (0)) = J_{i}^{*} = \sum_{t = 0}^{\infty} β^{t} M_{i} (ϱ_{i}^{v} (t), u_{i}^{*} (t), ϱ_{i}^{q} (t)) < \infty

(25)

From the definition of

M_{i}

, since the weighting matrices

Θ_{1 i i}

,

Θ_{2 i i}

, and

Θ_{3 i i}

are all positive definite,

M_{i} (ϱ_{i}^{v} (k), u_{i}^{*} (k), ϱ_{i}^{q} (k)) \geq 0

for all k. The equality

M_{i} = 0

holds if and only if

ϱ_{i}^{v} (k) = 0

,

ϱ_{i}^{q} (k) = 0

, and

u_{i}^{*} (k) = 0

.

For the infinite series in (25) to converge to a finite value with a discount factor

0 < β \leq 1

, it is a necessary condition that the terms of the series approach zero, that is:

lim_{k \to \infty} β^{k} M_{i} (ϱ_{i}^{v} (k), u_{i}^{*} (k), ϱ_{i}^{q} (k)) = 0

Since

β

is a constant, this implies:

lim_{k \to \infty} M_{i} (ϱ_{i}^{v} (k), u_{i}^{*} (k), ϱ_{i}^{q} (k)) = 0

(26)

Given that

M_{i}

is a sum of non-negative terms, for their sum to be zero, each individual term must be zero. Therefore, we must have:

\begin{matrix} lim_{k \to \infty} ϱ_{i}^{v T} (k) Θ_{1 i i} ϱ_{i}^{v} (k) & = 0 \\ lim_{k \to \infty} u_{i}^{* T} (k) Θ_{2 i i} u_{i}^{*} (k) & = 0 \\ lim_{k \to \infty} ϱ_{i}^{q T} (k) Θ_{3 i i} ϱ_{i}^{q} (k) & = 0 \end{matrix}

Since

Θ_{1 i i}

and

Θ_{3 i i}

are positive definite matrices, this directly leads to the conclusion that the error states converge to zero:

\begin{matrix} lim_{k \to \infty} ϱ_{i}^{v} (k) & = 0 \\ lim_{k \to \infty} ϱ_{i}^{q} (k) & = 0 \end{matrix}

This demonstrates that the origin of the error system is asymptotically stable under the optimal control policy

u_{i}^{*} (k)

. □

3.3. Dynamic Leader Election Algorithm

Containment control maintains voltage safety in microgrids by enforcing upper and lower bounds. However, conventional approaches usually predefine the upper-bound leader. If this leader experiences a heavy load, its voltage may decrease, violating the highest-voltage assumption and impairing effective reactive power sharing.

To dynamically adjust the containment control leader based on bus voltage, each DG must access the voltage of all DG-connected buses. However, due to the distributed communication architecture in microgrids, non-adjacent DGs cannot directly share information. Thus, a bus voltage estimation algorithm is required to enable indirect acquisition of voltage data among non-adjacent buses.

Let

{\hat{χ}}_{i} = {[{\hat{χ}}_{i 1}, {\hat{χ}}_{i 2}, \dots, {\hat{χ}}_{i N}]}^{T} \in R^{N}

denote the vector of bus voltage estimates by the ith DG, where

{\hat{χ}}_{i j}

represents the ith DG’s estimate of the voltage at the bus to which the jth DG is connected; furthermore,

χ_{i} = Ξ_{i, 1}

. Then, the update rule for the estimated value

{\hat{χ}}_{i j}, \forall i, j \in N

takes the form

{\dot{\hat{χ}}}_{i j} = - (\sum_{k = 1}^{N} a_{i k} ({\hat{χ}}_{i j} - {\hat{χ}}_{k j}) + a_{i j} ({\hat{χ}}_{i j} - χ_{j}))

(27)

where

a_{i j}

specifies the

(i, j)

entry in the adjacency matrix. The first term,

a_{i k} ({\hat{χ}}_{i j} - {\hat{χ}}_{k j})

, represents the difference between the ith DG’s estimate and its neighboring kth DG’s estimate of the bus voltage at the jth DG. The second term,

a_{i j} ({\hat{χ}}_{i j} - χ_{j})

, captures the error between the ith DG’s estimate and the actual bus voltage at the jth DG.

During each iteration of the microgrid controller, each DG estimates the voltages at all buses according to (27). Based on the estimated vector

{\hat{χ}}_{i} = {[{\hat{χ}}_{i 1}, {\hat{χ}}_{i 2}, \dots, {\hat{χ}}_{i N}]}^{T}

, the ith DG determines whether it is elected as an upper- or lower-bound leader. The specific rules are given as follows:

If ${\hat{χ}}_{i i}$ is the maximum value in ${\hat{χ}}_{i}$ , then node i is selected as the upper-bound leader.
If ${\hat{χ}}_{i i}$ is the minimum value in ${\hat{χ}}_{i}$ , then node i is selected as the lower-bound leader.
If ${\hat{χ}}_{i i}$ is neither the maximum nor the minimum value in ${\hat{χ}}_{i}$ , node i is not selected as a leader.
If there are multiple nodes whose values are equal to ${\hat{χ}}_{i i}$ and these values are the maximum or minimum in ${\hat{χ}}_{i}$ , node i is elected as the upper- or lower-bound leader only if its index i is the smallest; otherwise, it is not selected as a leader.

Other nodes follow the same procedure to determine their leadership status. In the following, we prove the convergence of the bus voltage estimation algorithm.

Theorem 3.

Consider the estimator dynamic for bus voltage in a microgrid, given by (27). Under Assumption 1, it holds that

{lim}_{t \to \infty} {\hat{χ}}_{i j} (t) = χ_{j}

for all

i, j = 1, \dots, N

.

Proof.

Introduce the error variable

ϖ_{i j} (t)

, defined by

ϖ_{i j} (t) = {\hat{χ}}_{i j} (t) - χ_{j},

where

ϖ_{i j} (t)

represents the estimation error of the bus voltage at jth DG, as estimated by the ith DG. Since

χ_{j}

is constant or varies very slowly in steady state, it follows that

{\dot{ϖ}}_{i j} (t) = {\dot{\hat{χ}}}_{i j} (t) .

Substituting the system dynamics yields:

{\dot{ϖ}}_{i j} = - (\sum_{k = 1}^{N} a_{i k} ({\hat{χ}}_{i j} - {\hat{χ}}_{k j}) + a_{i j} ({\hat{χ}}_{i j} - χ_{j})) .

(28)

By substituting the relation

ϖ_{k j} = {\hat{χ}}_{k j} - χ_{j}

into the above equation, the error dynamics can be expressed as

{\dot{ϖ}}_{i j} = - \sum_{k = 1}^{N} a_{i k} (ϖ_{i j} - ϖ_{k j}) - a_{i j} ϖ_{i j} = - (\sum_{k = 1}^{N} a_{i k} + a_{i j}) ϖ_{i j} + \sum_{k = 1}^{N} a_{i k} ϖ_{k j}

(29)

Next, define the Lyapunov function as

V (t) = \frac{1}{2} \sum_{i = 1}^{N} \sum_{j = 1}^{N} {(ϖ_{i j} (t))}^{2},

(30)

which satisfies

V (t) \geq 0

for all t, and

V (t)

equals zero only when

ϖ_{i j} (t)

is zero for all

i, j

.

By differentiating (30) with respect to time, we obtain

\dot{V} (t) = \sum_{i = 1}^{N} \sum_{j = 1}^{N} ϖ_{i j} {\dot{ϖ}}_{i j} = \sum_{i = 1}^{N} \sum_{j = 1}^{N} ϖ_{i j} [- \sum_{k = 1}^{N} a_{i k} (ϖ_{i j} - ϖ_{k j}) - a_{i j} ϖ_{i j}]

(31)

To facilitate further analysis, we separate the right-hand side of (31) into two components and define

V_{1} (t) = \sum_{i = 1}^{N} \sum_{j = 1}^{N} \sum_{k = 1}^{N} a_{i k} ϖ_{i j} (ϖ_{i j} - ϖ_{k j}) and V_{2} (t) = \sum_{i = 1}^{N} \sum_{j = 1}^{N} a_{i j} ϖ_{i j}^{2} .

Noting that

a_{i k} = a_{k i}

, we exchange the indices i and k in

V_{1} (t)

to obtain an equivalent form. Summing the two expressions yields

2 V_{1} (t) = \sum_{i = 1}^{N} \sum_{j = 1}^{N} \sum_{k = 1}^{N} a_{i k} {(ϖ_{i j} - ϖ_{k j})}^{2},

(32)

which implies that

V_{1} (t) = \frac{1}{2} \sum_{i = 1}^{N} \sum_{j = 1}^{N} \sum_{k = 1}^{N} a_{i k} {(ϖ_{i j} - ϖ_{k j})}^{2} \geq 0 .

(33)

Since every term

{(ϖ_{i j} - ϖ_{k j})}^{2}

is nonnegative and

a_{i k} \geq 0

, we also have

V_{2} (t) \geq 0

. Therefore,

\dot{V} (t) \leq 0

, which shows that

ϖ_{i j} \to 0

as

t \to \infty

, and consequently,

{\hat{χ}}_{i j} (t) \to χ_{j}

. Therefore, the ith DG’s estimation error for the jth DG’s bus voltage gradually converges to zero.

This completes the proof. □

Remark 2.

In direct contrast to conventional containment control methods based on fixed leaders, as proposed in [10,11], this work addresses their well-known limitation in handling complex scenarios like sudden large load changes. The fixed-leader approach often fails to ensure accurate reactive power sharing under such conditions. To overcome this specific flaw, this paper introduces a novel DLE algorithm. This mechanism, based on bus voltage estimation, allows each DG to dynamically select the leader according to real-time operating conditions. By doing so, it enables accurate reactive power sharing precisely where the conventional method falters, providing a clear, practical demonstration of its superiority over the static, fixed-leader approach.

3.4. RL-Based Containment Control Implementation

To ensure voltage containment and accurate reactive power sharing, this section proposes a control method based on actor–critic reinforcement learning. The actor network generates the control policy, while the critic network evaluates and guides its optimization. Through online iteration, the algorithm converges to the optimal control. The implementation structure is shown in Figure 2.

3.4.1. Critic Network

The critic network is designed to approximate the optimal value function, expressed as

V_{i}^{s} (k) = V_{i}^{s} (ϱ_{i}^{v} (k), u_{i} (k), ϱ_{i}^{q} (k))

. This network adopts a three-layer back-propagation neural architecture. Define the input vector of the critic as

υ_{i} (k) = {[ϱ_{i}^{v} (k), ϱ_{i}^{q} (k)]}^{T}

, where

N_{c}

denotes the quantity of neurons within the hidden layer, and

ω_{c 1}

and

ω_{c 2}

represent the weight matrices for the hidden and output layers, respectively. Accordingly, the hidden layer is supplied with

ς_{c 1} (k) = ω_{c 1} {[υ_{i} (k), u_{i} (k)]}^{T}

as its input. A hyperbolic tangent activation function,

ψ

, is employed in the hidden layer to capture smooth nonlinear relationships, with

ψ (ς_{c 1} (k)) = \frac{1 - exp (- ς_{c 1} (k))}{1 + exp (- ς_{c 1} (k))}

. The corresponding hidden layer output is

Λ_{c} (k) = ψ (ς_{c 1} (k))

. Ultimately, the output of the critic network at time k is represented by

V_{i}^{s} (k) = ω_{c 2} Λ_{c} (k)

.

The error term for the critic network is defined by

e_{c} (k) = V_{i}^{s + 1} (k) - [M_{i} (k) + β V_{i}^{s} (k + 1)]

. To train the critic network, gradient descent is employed to minimize the error

e_{c} (k)

, resulting in the following objective function:

min_{ω_{c 1}, ω_{c 2}} E_{c} (k) = min_{ω_{c 1}, ω_{c 2}} \frac{1}{2} {[e_{c} (k)]}^{2}

.

The iterative update laws for the weights

ω_{c 1}

and

ω_{c 2}

are given by

ω_{c 1}^{l + 1} (k) = ω_{c 1}^{l} (k) + Δ ω_{c 1}^{l} (k), Δ ω_{c 1}^{l} (k) = - τ [\frac{\partial E_{c}^{l} (k)}{\partial ω_{c 1}^{l} (k)}]

(34)

\frac{\partial E_{c}^{l} (k)}{\partial ω_{c 1}^{l} (k)} = \frac{\partial E_{c}^{l} (k)}{\partial e_{c}^{l} (k)} \frac{\partial e_{c}^{l} (k)}{\partial V_{i, l}^{s + 1} (k)} \frac{\partial V_{i, l}^{s + 1} (k)}{\partial Λ_{c} (k)} \frac{\partial Λ_{c} (k)}{\partial ς_{c 1} (k)} \frac{\partial ς_{c 1} (k)}{\partial ω_{c 1} (k)} .

(35)

where l is the neural network iteration index, and

τ

represents the learning rate. For the output layer weights:

ω_{c 2}^{l + 1} (k) = ω_{c 2}^{l} (k) + Δ ω_{c 2}^{l} (k), Δ ω_{c 2}^{l} (k) = - τ [\frac{\partial E_{c}^{l} (k)}{\partial ω_{c 2}^{l} (k)}]

(36)

\frac{\partial E_{c}^{l} (k)}{\partial ω_{c 2}^{l} (k)} = \frac{\partial E_{c}^{l} (k)}{\partial e_{c}^{l} (k)} \frac{\partial e_{c}^{l} (k)}{\partial V_{i, l}^{s + 1} (k)} \frac{\partial V_{i, l}^{s + 1} (k)}{\partial ω_{c 2}^{l} (k)}

(37)

3.4.2. Actor Network

The actor network is constructed to approximate the optimal control policy

u_{i}^{*}

. It is implemented as a three-layer neural network, where the hidden layer consists of

N_{a}

neurons and employs the hyperbolic tangent activation function. Denote

ω_{a 1}

and

ω_{a 2}

as the weight matrices for the hidden and output layers, respectively. The hidden layer output can be expressed as

Λ_{a} (k) = ψ (ς_{a 1} (k)) = \frac{1 - {exp}^{- ς_{a 1} (k)}}{1 + {exp}^{- ς_{a 1} (k)}}

, where

ς_{a 1} (k) = ω_{a 1} υ_{i} (k)

. The final output of the network is given by

u_{i}^{s} (k) = \frac{1 - {exp}^{- ϑ_{i}^{a} (k)}}{1 + {exp}^{- ϑ_{i}^{a} (k)}}

, with

ϑ_{i}^{a} (k) = ω_{a 2} Λ_{a} (k)

.

By continuously adjusting the parameters

ω_{a 1}

and

ω_{a 2}

, the network aims to derive the optimal control input based on

υ_{i} (k)

. The parameter update is guided by minimizing the objective function

min_{ω_{a 1}, ω_{a 2}} E_{a} (k) = min_{ω_{a 1}, ω_{a 2}} \frac{1}{2} {[e_{a} (k)]}^{2}

, where the error term is defined by

e_{a} (k) = η V_{i}^{s + 1} (k + 1) + γ {(u_{i}^{s + 1} (k))}^{2}

.

The purpose of the optimization problem is to ensure that the actor network produces an optimal control action that minimizes the value function. Similar to the critic network, the weights

ω_{a 1}

and

ω_{a 2}

are updated using gradient descent:

ω_{a 1}^{l} (k) = ω_{a 1}^{l} (k) + Δ ω_{a 1}^{l} (k), Δ ω_{a 1}^{l} (k) = - τ [\frac{\partial E_{a}^{l} (k)}{\partial ω_{a 1}^{l} (k)}]

(38)

\frac{\partial E_{a}^{l} (k)}{\partial ω_{a 1}^{l} (k)} = \frac{\partial E_{a}^{l} (k)}{\partial e_{a}^{l} (k)} \frac{\partial e_{a}^{l} (k)}{\partial u_{i}^{s + 1} (k)} \frac{\partial u_{i}^{s + 1} (k)}{\partial ω_{a 1}^{l} (k)} + \frac{\partial E_{a}^{l} (k)}{\partial e_{a}^{l} (k)} \frac{\partial e_{a}^{l} (k)}{\partial V_{i, l}^{s + 1} (k + 1)} \frac{\partial V_{i, l}^{s + 1} (k + 1)}{\partial u_{i}^{s + 1} (k + 1)} \frac{\partial u_{i} (k + 1)}{\partial ω_{a 1}^{l} (k)}

(39)

Similarly, the weights of the output layer are updated by

ω_{a 2}^{l + 1} (k) = ω_{a 2}^{l} (k) + Δ ω_{a 2}^{l} (k), Δ ω_{a 2}^{l} (k) = - τ [\frac{\partial E_{a}^{l} (k)}{\partial ω_{a 2}^{l} (k)}]

(40)

\frac{\partial E_{a}^{l} (k)}{\partial ω_{a 2}^{l} (k)} = \frac{\partial E_{a}^{l} (k)}{\partial e_{a}^{l} (k)} \frac{\partial e_{a}^{l} (k)}{\partial u_{i}^{s + 1} (k)} \frac{\partial u_{i}^{s + 1} (k)}{\partial ω_{a 2}^{l} (k)} + \frac{\partial E_{a}^{l} (k)}{\partial e_{a}^{l} (k)} \frac{\partial e_{a}^{l} (k)}{\partial V_{i, l}^{s + 1} (k + 1)} \frac{\partial V_{i, l}^{s + 1} (k + 1)}{\partial ω_{a 2}^{l} (k)}

(41)

Remark 3.

As a typical nonlinear system, island microgrids present challenges for model-based controller design due to difficulties in obtaining practical parameters such as resistance and inductance [17,18]. These methods are also sensitive to measurement errors, further complicating controller design. In contrast to [18], this paper proposes a data-driven online reinforcement learning approach that does not require extensive offline data processing. The control policy is iteratively optimized by minimizing the value function, enabling accurate reactive power sharing and effective voltage control in the microgrid.

Remark 4.

From a practical standpoint, the proposed DLE and model-free RL framework is designed to address key operational challenges in real-world offshore microgrids. First, the DLE algorithm provides crucial operational flexibility and resilience. It enables the microgrid to autonomously adapt to the harsh and dynamic conditions of offshore environments (e.g., sudden load changes, volatile renewables), overcoming the rigidity of fixed-leader schemes to ensure stability without manual intervention. Second, the model-free RL controller eliminates the reliance on an accurate system model, which is a significant practical challenge because obtaining line parameters is often both difficult and costly. By learning directly from measurement data, our approach simplifies deployment, reduces commissioning costs, and enhances robustness against parameter uncertainties and system aging. Collectively, these features make the proposed framework not only technically effective but also practical, cost-efficient, and resilient for real-world offshore applications.

4. Simulation Results

As shown in Figure 3, the offshore island AC microgrid system consisted of four renewable generation units. The algorithm proposed in this paper was validated on a simulation model built on the Simulink platform. The validation strategy across the following cases was deliberately designed to highlight the practical value of the proposed model-free approach. Instead of a direct numerical comparison against a model-based controller, which can be misleading (as its performance is entirely dependent on an idealized, perfectly accurate model that is unavailable in reality), our validation focused on two key aspects. First, in Case 4.1, we conducted a head-to-head comparison with a conventional fixed-leader method [10] to demonstrate that our DLE algorithm solved a fundamental operational flaw. Second, in Cases 4.2 and 4.3, we verified that our model-free controller robustly achieved all control objectives under challenging conditions (load changes and plug-and-play), thereby proving its effectiveness and practical viability on its own terms. Following the approach in [10], the allowable voltage deviation was set to ±1%, and the rated bus voltage

V_{bus}

was selected as 311 V, which served as the system design objective. Specific simulation parameters are provided in Table 1, and other related parameters were as follows:

β

was selected to be

0.98

as the discount factor. The performance index employed the following weighting parameters: both

Θ_{1 i i}

and

Θ_{3 i i}

were diagonal matrices with diagonal elements equal to 1, and

Θ_{2 i i} = 0.1

. Both the actor and critic networks employed five hidden neurons.

4.1. Dynamic Leader Election

This case was designed to validate the effectiveness of the proposed dynamic leader election (DLE) algorithm. To achieve this, we first established a benchmark scenario from

t = 0 s

to

t = 10 s

by implementing the fixed-leader containment control described in [10]. The purpose of this benchmark was to replicate a well-known limitation of conventional methods. As shown in Figure 4, during this benchmark period, while the system successfully maintained voltage containment (i.e., all bus voltages remained within the safe range), the reactive-power-sharing ratio failed to achieve the desired

2 : 1 : 2 : 1

. This outcome was an expected consequence of the fixed-leader topology: since DG1 was designated as the upper-limit leader, its bus voltage was consistently maintained at the highest level, which inherently restricted reactive power flow and prevented equitable sharing among the DGs. This scenario effectively highlighted the specific problem that the data-driven model-free approach was designed to overcome without relying on pre-configured roles or precise system parameters. Subsequently, at

t = 10 s

, the load distribution was changed by transferring the load on Bus 3 to Bus 2. With the leaders still fixed as DG1 and DG4, it can be observed from Figure 4 that although the microgrid remained effective in voltage containment control, reactive power sharing was still not fully achieved.

At

t = 20 s

, the proposed dynamic leader election algorithm was enabled. As shown in Figure 4, the dynamic leader election algorithm allowed the upper-limit leader to be automatically elected as DG4 and the lower-limit leader to be automatically elected as DG1. Under the effect of the dynamic leader election algorithm, the microgrid not only achieved voltage containment control but also brought the reactive-power-sharing ratio to

2 : 1 : 2 : 1

, successfully achieving precise reactive power sharing. At

t = 30 s

, the load on Bus 2 was transferred back to Bus 3. From Figure 4, it can be observed that after a brief transient process, the microgrid once again achieved voltage containment control, and the reactive-power-sharing ratio was restored to

2 : 1 : 2 : 1

, ensuring precise reactive power sharing. The results confirm that the proposed algorithm adaptively selects leaders based on bus voltage magnitude.

4.2. Load Variation

Through simulation and comparative experiments, this study verified the effectiveness of the proposed dynamic leader election algorithm and model-free reinforcement learning algorithm. The proposed approach achieved both voltage recovery and accurate reactive power sharing. First, from

t = 0 s

to

t = 5 s

, the microgrid employed only the conventional PI control strategy described in [7], as shown in Figure 5. During that phase, the load on Bus 1 of the microgrid was

8 kW

, and the load on Bus 3 was

10 kW

. It can be seen that the microgrid voltage was not restored to the safe level, nor was the reactive-power-sharing ratio precisely maintained at

2 : 1 : 2 : 1

.

At

t = 5 s

, the proposed dynamic leader election and model-free reinforcement learning algorithms were enabled. As shown in Figure 5, the voltage quickly recovered to within the safe constraint range, and the reactive-power-sharing ratio also reached

2 : 1 : 2 : 1

, achieving accurate reactive power sharing. To further validate the robustness of the algorithms under load variation, at

t = 10 s

, the load on Bus 3 (DG3) was increased by

3 kW

. From Figure 5, it can be observed that after experiencing a brief transient process, the microgrid voltage returned to the steady state and remained within the safe range. Simultaneously, the reactive-power-sharing ratio again reached

2 : 1 : 2 : 1

, achieving precise reactive power sharing. Figure 6 shows the evolution of actor–critic neural network weights for DG1 during the simulation. All weights converged to stable values, as illustrated.

4.3. Plug-And-Play Capability

The plug-and-play capability of microgrids enables the rapid integration or removal of DGs, allowing the system to adapt to load changes and equipment failures, thereby improving overall flexibility and scalability. To comprehensively and realistically validate the plug-and-play performance of the proposed algorithm, this section designs an experiment that includes a “plug-out” event and a “plug-in” process that mimics real-world engineering scenarios.

The simulation results are shown in Figure 7. During the time period from

t = 0 s

to

t = 10 s

, the microgrid operated stably with all four DGs. The proposed algorithms achieved voltage containment control and accurate reactive power sharing with a ratio of

2 : 1 : 2 : 1

. At

t = 10 s

, DG4 was disconnected (plug-out) to simulate its removal from operation. The simulation results show that after DG4 was removed, the power deficit was automatically compensated by the remaining DGs. The system voltage, after a brief transient, quickly stabilized and remained within the safe constraint range. Meanwhile, reactive power was redistributed among the three remaining DGs, reaching a new stable sharing ratio of

2 : 1 : 2

. To simulate the reconnection process of a DG, at

t = 20 s

, DG4 initiated the synchronization process with the microgrid. During that phase, DG4 adjusted its output voltage frequency, phase, and amplitude to match the microgrid’s parameters in preparation for grid connection. At

t = 25 s

, upon successful synchronization, DG4 was physically connected to the microgrid, and its controller was activated. As observed in Figure 7, the system seamlessly reintegrated DG4. The voltage remained stable, and the reactive-power-sharing ratio, after a short dynamic adjustment, was accurately restored to the initial

2 : 1 : 2 : 1

state.

This complete test, encompassing both a plug-out event and a realistic plug-in process, robustly demonstrates that the proposed dynamic leader election and model-free reinforcement learning algorithms provide the microgrid with plug-and-play capability, ensuring safe and stable operation under dynamic topological changes.

5. Conclusions

This paper developed a secondary control method for offshore island microgrids based on a model-free reinforcement learning algorithm and a dynamic leader election mechanism. First, by combining the microgrid’s voltage containment error and reactive power sharing error, a value function for policy iteration was constructed. Then, a dynamic leader election algorithm was designed, enabling different DGs to be dynamically elected as leaders to facilitate accurate reactive power allocation. Subsequently, a model-free reinforcement learning algorithm was developed, which relied solely on real-time measurements of voltage and reactive power without requiring a complex system model.

However, it is important to acknowledge that this study was conducted under the assumption of an ideal island microgrid model, where factors such as communication delays, external disturbances, and potential cyber-attacks were not considered. Communication delays, which are inherent in distributed control systems, could introduce time lags in the information exchange among DGs. This might affect the timeliness of the dynamic leader election process and degrade the performance of the reinforcement learning algorithm, potentially leading to oscillations or even instability. Similarly, other disturbances, such as measurement noise and unmodeled dynamics, could impact the accuracy of the data-driven RL algorithm, which is highly dependent on the quality of measurement data. Addressing these practical challenges is crucial for real-world implementation. Therefore, these aspects will be the focus of our future work. We plan to investigate and develop more robust control strategies that can tolerate communication delays and are resilient to various disturbances. This may involve integrating predictive control mechanisms or designing delay-compensation techniques within the RL framework. To validate the effectiveness and robustness of the enhanced methods, we intend to conduct more comprehensive hardware-in-the-loop simulations or tests on a physical experimental platform.

Author Contributions

Formal analysis, X.Y.; Funding acquisition, Z.W.; Investigation, S.W.; Project administration, Z.W.; Supervision, Z.W.; Validation, X.Y.; Visualization, Q.W.; Writing—original draft, X.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under No. 62373089.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

Author Shuran Wang was employed by the State Grid Jilin Electric Power Co., Ltd. Changchun Power Supply Company. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Wang, S.; Wang, Z.; Liu, X.; Ye, X. An SoC-based bidirectional virtual DC machine control for energy storage systems in offshore isolated island DC microgrids. J. Mar. Sci. Eng. 2023, 11, 1502. [Google Scholar] [CrossRef]
Wang, F.; Teng, F.; Xiao, G.; He, Y.; Feng, Q. Resilient distributed secondary control strategy for polymorphic seaport microgrid against estimation-dependent FDI attacks. J. Mar. Sci. Eng. 2022, 10, 1668. [Google Scholar] [CrossRef]
Zhou, J.; Weng, Z.; Li, J.; Song, X. Reliability evaluation, planning, and economic analysis of microgrid with access to renewable energy and electric vehicles. Electr. Power Syst. Res. 2024, 230, 110252. [Google Scholar] [CrossRef]
Nasirian, V.; Shafiee, Q.; Guerrero, J.M.; Lewis, F.L.; Davoudi, A. Droop-free distributed control for ac microgrids. IEEE Trans. Power Electron. 2015, 31, 1600–1617. [Google Scholar] [CrossRef]
Li, W.; Zhao, H.; Zhu, J.; Yang, T. A novel reactive power sharing control strategy for shipboard microgrids based on deep reinforcement learning. J. Mar. Sci. Eng. 2025, 13, 718. [Google Scholar] [CrossRef]
Ahmed, K.; Seyedmahmoudian, M.; Mekhilef, S.; Mubarak, N.; Stojcevski, A. A review on primary and secondary controls of inverter-interfaced microgrid. J. Mod. Power Syst. Clean Energy 2020, 9, 969–985. [Google Scholar] [CrossRef]
Shafiee, Q.; Guerrero, J.M.; Vasquez, J.C. Distributed secondary control for islanded microgrids—A novel approach. IEEE Trans. Power Electron. 2013, 29, 1018–1031. [Google Scholar] [CrossRef]
Xiao, H.; Liu, G.; Huang, J.; Hou, S.; Zhu, L. Parameterized and centralized secondary voltage control for autonomous microgrids. Int. J. Electr. Power Energy Syst. 2022, 135, 107531. [Google Scholar] [CrossRef]
Mohiuddin, S.M.; Qi, J. Optimal distributed control of ac microgrids with coordinated voltage regulation and reactive power sharing. IEEE Trans. Smart Grid 2022, 13, 1789–1800. [Google Scholar] [CrossRef]
Han, R.; Meng, L.; Ferrari-Trecate, G.; Coelho, E.A.A.; Vasquez, J.C.; Guerrero, J.M. Containment and consensus-based distributed coordination control to achieve bounded voltage and precise reactive power sharing in islanded ac microgrids. IEEE Trans. Ind. Appl. 2017, 53, 5187–5199. [Google Scholar] [CrossRef]
Zhai, M.-N.; Sun, J. Distributed critical bus voltage regulation control for multimicrogrids with positive minimum interevent times. IEEE Trans. On Ind. Inform. 2023, 20, 5774–5783. [Google Scholar] [CrossRef]
Zhai, M.; Sun, Q.; Wang, R.; Zhang, H. Containment-based multiple pcc voltage regulation strategy for communication link and sensor faults. IEEE/CAA J. Autom. Sin. 2023, 10, 2045–2055. [Google Scholar] [CrossRef]
Xia, Y.; Xu, Y.; Wang, Y.; Mondal, S.; Dasgupta, S.; Gupta, A.K. Optimal secondary control of islanded ac microgrids with communication time-delay based on multi-agent deep reinforcement learning. CSEE J. Power Energy Syst. 2022, 9, 1301–1311. [Google Scholar]
Toro, V.; Tellez-Castro, D.; Mojica-Nava, E.; Rakoto-Ravalontsalama, N. Data-driven distributed voltage control for microgrids: A koopman-based approach. Int. J. Electr. Power Energy Syst. 2023, 145, 108636. [Google Scholar] [CrossRef]
Huang, Y.; Liu, G.-P.; Yu, Y.; Hu, W. Data-driven distributed predictive control for voltage regulation and current sharing in dc microgrids with communication constraints. IEEE Trans. Cybern. 2024, 54, 4998–5011. [Google Scholar] [CrossRef]
Zholbaryssov, M.; Dominguez-Garcia, A.D. Safe data-driven secondary control of distributed energy resources. IEEE Trans. Power Syst. 2021, 36, 5933–5943. [Google Scholar] [CrossRef]
Bidram, A.; Davoudi, A.; Lewis, F.L.; Guerrero, J.M. Distributed cooperative secondary control of microgrids using feedback linearization. IEEE Trans. Power Syst. 2013, 28, 3462–3470. [Google Scholar] [CrossRef]
Gu, W.; Lou, G.; Tan, W.; Yuan, X. A nonlinear state estimator-based decentralized secondary voltage control scheme for autonomous microgrids. IEEE Trans. Power Syst. 2017, 32, 4794–4804. [Google Scholar] [CrossRef]
Lin, S.-W.; Chu, C.-C. Distributed q-learning-based voltage restoration algorithm in isolated ac microgrids subject to input saturation. IEEE Trans. Ind. Appl. 2024, 60, 5447–5459. [Google Scholar] [CrossRef]
Han, Y.; Li, H.; Shen, P.; Coelho, E.A.A.; Guerrero, J.M. Review of active and reactive power sharing strategies in hierarchical controlled microgrids. IEEE Trans. Power Electron. 2016, 32, 2427–2451. [Google Scholar] [CrossRef]
An, R.; Liu, Z.; Liu, J. Successive-approximation-based virtual impedance tuning method for accurate reactive power sharing in islanded microgrids. IEEE Trans. Power Electron. 2020, 36, 87–102. [Google Scholar] [CrossRef]
Kersting, W.H. Distribution System Modeling and Analysis, 2nd ed.; CRC Press: Boca Raton, FL, USA, 2007. [Google Scholar]
Yang, J.; Zhang, N.; Kang, C.; Xia, Q. A state-independent linear power flow model with accurate estimation of voltage magnitude. IEEE Trans. Power Syst. 2016, 32, 3607–3617. [Google Scholar] [CrossRef]
Rocabert, J.; Luna, A.; Blaabjerg, F.; Rodriguez, P. Control of power converters in ac microgrids. IEEE Trans. Power Electron. 2012, 27, 4734–4749. [Google Scholar] [CrossRef]
Wang, R.; Ma, D.; Li, M.-J.; Sun, Q.; Zhang, H.; Wang, P. Accurate current sharing and voltage regulation in hybrid wind/solar systems: An adaptive dynamic programming approach. IEEE Trans. Consum. Electron. 2022, 68, 261–272. [Google Scholar] [CrossRef]
Li, T.; Bai, W.; Liu, Q.; Long, Y.; Chen, C.P. Distributed fault-tolerant containment control protocols for the discrete-time multiagent systems via reinforcement learning method. IEEE Trans. Neural Netw. Learning Syst. 2021, 34, 3979–3991. [Google Scholar] [CrossRef]

Figure 1. Control block diagram of DGs.

Figure 2. DLE and RL algorithms for island microgrid control framework.

Figure 3. Offshore island microgrid with four DGs.

Figure 4. Comparative analysis of the microgrid’s fixed and dynamic leaders. (a) Bus voltage. (b) Reactive power sharing.

Figure 5. Load variation. (a) Bus voltage. (b) Reactive power sharing.

Figure 6. Variation in weights for the first DG. (a) Weight of the actor network output layer. (b) Weight of the critic network output layer. (c) Weights of the actor network hidden layer. (d) Weights of the critic network hidden layer.

Figure 7. Plug-and-play capability. (a) Output voltage. (b) Reactive power sharing.

Table 1. System parameters.

Symbol	Parameter	Value	Symbol	Parameter	Value
$V_{b u s}$	Rated bus voltage	311 V	$S_{r a t e, 1}$	Rated power DG1	25 kW, 20 kVar
$Z_{l i n e, 1}$	Line impedance of DG1	$0.03 Ω + 0.56 mH$	$S_{r a t e, 2}$	Rated power DG2	20 kW, 10 kVar
$Z_{l i n e, 2}$	Line impedance of DG2	$0.06 Ω + 0.8 mH$	$S_{r a t e, 3}$	Rated power DG3	25 kW, 20 kVar
$Z_{l i n e, 3}$	Line impedance of DG3	$0.03 Ω + 0.56 mH$	$S_{r a t e, 4}$	Rated power DG4	20 kW, 10 kVar
$Z_{l i n e, 4}$	Line impedance of DG4	$0.06 Ω + 0.8 mH$	$L_{f}$	DG filter inductance	1 mH
$Z_{l i n e, 12}$	Line impedance of DG1,2	$0.6 Ω + 3.2 mH$	$C_{f}$	DG filter capacitor	100 μF
$Z_{l i n e, 23}$	Line impedance of DG2,3	$0.4 Ω + 2.4 mH$	$l o a d 1$	Capacity of load1	12.5 kW, 8 kVar
$Z_{l i n e, 34}$	Line impedance of DG3,4	$0.5 Ω + 2.8 mH$	$l o a d 2$	Capacity of load2	15.8 kW, 10 kVar

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ye, X.; Wang, Z.; Wang, Q.; Wang, S. Dynamic Leader Election and Model-Free Reinforcement Learning for Coordinated Voltage and Reactive Power Containment Control in Offshore Island AC Microgrids. J. Mar. Sci. Eng. 2025, 13, 1432. https://doi.org/10.3390/jmse13081432

AMA Style

Ye X, Wang Z, Wang Q, Wang S. Dynamic Leader Election and Model-Free Reinforcement Learning for Coordinated Voltage and Reactive Power Containment Control in Offshore Island AC Microgrids. Journal of Marine Science and Engineering. 2025; 13(8):1432. https://doi.org/10.3390/jmse13081432

Chicago/Turabian Style

Ye, Xiaolu, Zhanshan Wang, Qiufu Wang, and Shuran Wang. 2025. "Dynamic Leader Election and Model-Free Reinforcement Learning for Coordinated Voltage and Reactive Power Containment Control in Offshore Island AC Microgrids" Journal of Marine Science and Engineering 13, no. 8: 1432. https://doi.org/10.3390/jmse13081432

APA Style

Ye, X., Wang, Z., Wang, Q., & Wang, S. (2025). Dynamic Leader Election and Model-Free Reinforcement Learning for Coordinated Voltage and Reactive Power Containment Control in Offshore Island AC Microgrids. Journal of Marine Science and Engineering, 13(8), 1432. https://doi.org/10.3390/jmse13081432

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dynamic Leader Election and Model-Free Reinforcement Learning for Coordinated Voltage and Reactive Power Containment Control in Offshore Island AC Microgrids

Abstract

1. Introduction

2. Preliminaries and Problem Formulation

2.1. Graph Theory

2.2. Model Descriptions of Island AC Microgrids

2.3. Reactive Power Sharing and Voltage Regulation

2.4. Optimal Performance Metrics

3. Coordinated Voltage and Reactive Power Control Scheme Design Based on DLE and RL Algorithms

3.1. Convergence Analysis of Policy Iteration

3.2. Stability Analysis of Coordinated Voltage and Reactive Power Control

3.3. Dynamic Leader Election Algorithm

3.4. RL-Based Containment Control Implementation

3.4.1. Critic Network

3.4.2. Actor Network

4. Simulation Results

4.1. Dynamic Leader Election

4.2. Load Variation

4.3. Plug-And-Play Capability

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI