1. Introduction
Renewable generation is playing an increasingly important role in the development and utilization of marine resources [
1,
2]. Specifically, the proportion of renewable generation units in offshore island microgrids is higher than that in land-based microgrids. These renewable generation units are integrated into the grid via inverters that convert direct current (DC) into AC [
3]. Voltage serves as a crucial parameter in island AC microgrid operation [
4]. Proper reactive power allocation affects not only power balance but also voltage regulation and line loss optimization. Since island microgrids are not connected to the main grid, voltage and reactive power regulation depend entirely on internal coordinated control [
5]. Additionally, island microgrids operate under harsh marine climatic conditions, which pose significant challenges to their operational security. Given their important role in promoting the development of marine resources, voltage regulation and reactive power sharing have become key issues that require extensive research [
6].
Voltage regulation and reactive power sharing have been extensively investigated in previous research. The average voltage regulation method in [
7] ensures accurate reactive power sharing but may violate voltage safety constraints under load fluctuations. The weighted coefficient method in [
8] maintains voltage within safe limits but relies on empirical tuning, limiting scalability. The optimization-based approach in [
9] achieves precise control but is computationally intensive and hard to implement in real time. These limitations hinder the practical deployment of these methods in island microgrids.
To address the conflict between voltage regulation and reactive power sharing in island AC microgrids, containment control has gradually attracted increasing attention. In [
10], the use of containment control was first proposed to balance voltage regulation and reactive power sharing in microgrids, laying a theoretical foundation for subsequent research in this field. The application of containment control was further extended in [
11] to address voltage regulation issues among multiple interconnected microgrids, which significantly improved the performance and stability of coordinated multi-microgrid operation. In [
12], the discussion focuses on a containment control-based strategy for voltage regulation in microgrids facing communication and sensor failures. However, in the above methods, the leaders for containment control are typically predetermined. When the bus associated with the upper-bound leader experiences a heavy load, its voltage may drop significantly, which contradicts the assumption that the upper leader should always maintain the highest voltage. This situation leads to ineffective reactive power transfer among buses. The fixed leader configuration lacks flexibility and reduces the adaptability of the island microgrid under dynamic operating conditions.
In island microgrids that have already been put into operation, it is difficult to accurately obtain model parameters such as resistance and inductance due to the limited precision of measurement devices and the influence of operating conditions (such as temperature variations). In contrast, data-driven approaches that do not rely on explicit model parameters can directly utilize operational data from the microgrid for controller design. In [
13], a deep learning-based secondary controller for microgrids is designed using historical data. In [
14], Koopman’s operator theory enables voltage control based on input–output data. In [
15], a data-driven distributed predictive control method achieves voltage restoration and current sharing via an incremental linear model. In [
16], least squares and Gaussian process regression are used to learn system sensitivity and estimate modeling errors, ensuring optimal and safe microgrid control. However, the method in [
13] requires processing a large quantity of offline data, while the approach in [
15] has high computational complexity and slow convergence of control performance, making it difficult to satisfy the real-time requirements of microgrid control. The methods in [
14,
16] rely heavily on prior knowledge and historical data. These issues limit the widespread application of data-driven methods in practical island microgrids.
Based on the above discussion, this paper proposes an island microgrid voltage regulation and reactive-power-sharing control strategy that combines a dynamic leader election algorithm with a model-free reinforcement learning algorithm. The proposed strategy utilizes a leader election algorithm to dynamically adjust the leader roles according to the load conditions of each bus in the island microgrid: the distributed generation (DG) corresponding to the bus with a higher load is set as the lower-bound leader, while the DG associated with the bus with a lighter load is set as the upper-bound leader. In this way, flexible reactive power flow among buses is promoted, enabling precise reactive power sharing. Meanwhile, by designing value functions for voltage and reactive power errors and employing a model-free reinforcement learning algorithm, the controller is designed based solely on island microgrid operational data without requiring any model information. Furthermore, this paper theoretically proves the convergence of the leader election algorithm and the optimality of the policy iteration algorithm in model-free reinforcement learning. Lastly, the proposed strategy is validated through a series of simulation experiments. In the experiments, the effectiveness of the proposed method is validated in three distinct case studies, which confirm that it restores the voltage of island AC microgrids to the reference range set by containment control and accomplishes accurate reactive power sharing. The main contributions of this research are summarized below:
To address the limitations of containment control methods based on fixed leaders, as proposed in [
10,
11], which struggle with complex scenarios like sudden large load changes, this paper introduces a novel DLE algorithm. Unlike the static nature of fixed-leader approaches, our DLE mechanism is based on bus voltage estimation, allowing each DG to dynamically select the leader according to the relative magnitude of the estimated voltages. This adaptive capability enables accurate reactive power sharing even under sudden load changes or large load fluctuations, significantly enhancing the microgrid’s flexibility.
To overcome the challenges of model-based controller design, as highlighted in [
17,
18], where obtaining practical parameters like resistance and inductance is difficult, this paper proposes a data-driven online reinforcement learning approach. In contrast to model-based methods that are sensitive to parameter uncertainties and measurement errors, our algorithm does not require extensive offline data processing. The control policy is iteratively optimized online by minimizing a value function, enabling accurate reactive power sharing and effective voltage control in the microgrid without relying on a precise system model.
The structure of this paper is as follows: The necessary background and preliminaries are outlined in
Section 2.
Section 3 proves the convergence of policy iteration, proposes a dynamic leader election algorithm, and designs a model-free reinforcement learning algorithm.
Section 4 validates the effectiveness of the proposed methods through numerical experiments. Finally,
Section 5 summarizes the main contributions of this paper.
2. Preliminaries and Problem Formulation
This section presents the graph theory, island microgrid modeling, analysis of reactive power and voltage coupling, and the formulation of performance indices for containment control.
2.1. Graph Theory
This paper considers an island AC microgrid modeled as a multi-agent system (MAS), comprising N follower agents and M virtual leader agents. The MAS is represented by an undirected graph , where denotes the node set, while corresponds to the collection of edges. Two nodes and are considered neighbors if . For the graph , is the adjacency matrix, where if ; otherwise, . The degree matrix is , where is the degree of the ith node in the graph, and represents the set of neighbors of node . The Laplacian matrix L is then defined by . Matrix () is the pinning gain matrix associated with the rth virtual leader, where if the rth virtual leader can communicate with the ith follower; otherwise, .
2.2. Model Descriptions of Island AC Microgrids
Assume that all DGs are interfaced with the island microgrid through voltage source inverters (VSIs), each equipped with an output LC filter, as illustrated in
Figure 1. By applying feedback linearization [
17], the following relationship is established:
where
, and
. For the sake of brevity, while the complete derivation is detailed in [
11], the key terms are briefly defined here.
is the state vector of the
ith DG, which includes filter currents and capacitor voltages in the dq-frame.
is the output function, defined as the bus voltage
.
and
represent the drift and input vector fields of the nonlinear system, respectively.
and
are the first- and second-order Lie derivatives of the output function with respect to the system dynamics.
is the new, linearized control input. According to [
11],
is used to denote the bus voltage
. Since the control of microgrids is typically implemented within a digital control framework, it is necessary to discretize the aforementioned continuous-time equations. After discretization, the variable
in Equation (
1) can be reformulated as follows:
where the new control input is
. The state variable is defined by
. The discrete-time system matrices
and
are derived from
A and
B (as described in [
11]) using the zero-order hold method with sampling interval
, where
and
[
19]. Thus,
and
.
The linearized system uses the state vector to describe the voltage dynamics, where and its derivative are captured.
2.3. Reactive Power Sharing and Voltage Regulation
Under islanded conditions, each DG applies the conventional
droop control [
20], i.e.,
, where
and
are the voltage and voltage magnitude references,
is the droop coefficient, and
is the reactive power output of the
ith DG. The objective of reactive power sharing is [
21]:
.
Following the conventional Kron reduction based on steady-state parameters [
22], the reduced bus admittance matrix
is obtained. The reactive power at bus
i is given by [
9]:
where
and
are the real and imaginary parts of
between buses
i and
j (
),
is the phase angle difference, and
is the set of buses connected to
i (including
i itself). Assuming small power angles (
,
) [
23] and predominantly inductive feeder impedance (
) [
24], (
3) simplifies to
. Letting
and combining with (
2), the discrete-time dynamics of
are:
Accurate reactive power sharing is difficult due to the coupling between voltage regulation and reactive power [
10]. Tight voltage control limits reactive power exchange and leads to sharing imbalance. To overcome this, a containment control strategy is used to maintain voltage within set bounds. The boundary dynamics are given by:
where
,
. Here,
and
denote the upper and lower voltage reference limits. According to MAS theory, the neighborhood containment error
is given by:
Based on (
2), (
5), and (
6), the dynamic equation satisfied by the containment voltage error can be derived as follows:
The reactive power sharing error
quantifies deviations in the reactive power contribution of DGs. It is defined by:
According to (
4), the dynamic equation of the reactive-power-sharing error is given by:
By minimizing these errors, the proposed control strategy achieves both containment-based voltage regulation and accurate reactive power sharing.
Assumption A1. For any virtual leader, one or more paths exist that connect its dynamic behavior to every follower DG in the network.
2.4. Optimal Performance Metrics
Each DG
i optimizes its cost via a game, using a local performance index as in [
25] to ensure proper power sharing, low energy consumption, and voltage security:
where
, and
,
,
are all positive weighting matrices.
denotes the discount factor, satisfying
. Each DG optimizes its control strategy locally through communication with neighbors to achieve system objectives.
Definition 1 ([
25]).
The control action is considered admissible if it stabilizes Equations (7) and (9) and ensures that remains bounded. For any admissible control policy
, the local performance function
of the
ith DG can be expressed as
by applying the Bellman optimality principle. Specifically, the optimal local performance function is given by:
where
denotes the optimal value function, subject to the boundary condition
. Equation (
11) represents the HJB equation. Accordingly,
represents the local containment control input that achieves optimality for the microgrid, and its derivation is provided below:
Remark 1. While conventional model-based methods can theoretically compute the optimal control by solving the HJB Equation (12), their practical application is hindered by a significant limitation: the reliance on precise microgrid parameters that are often unavailable or uncertain in real-world scenarios. To address this fundamental challenge, this paper proposes a model-free reinforcement learning approach. Instead of requiring an explicit system model, this method approximates the HJB solution and derives the optimal policy directly from input–output data using an actor–critic framework. This data-driven nature represents a key advantage, enhancing the controller’s robustness and practical value compared to conventional methods that depend on an idealized and often inaccurate system model. 3. Coordinated Voltage and Reactive Power Control Scheme Design Based on DLE and RL Algorithms
This section describes the coordinated control scheme for voltage and reactive power based on DLE and RL algorithms.
Figure 2 shows the overall control procedure.
3.1. Convergence Analysis of Policy Iteration
The iterative learning algorithm is applied to the containment controller as an optimization method using historical data. Each DG exchanges voltage and reactive power information with others via the communication network. A time sequence with interval is defined. In policy iteration, the performance function is evaluated for a feasible policy and, as s increases, both performance function and control policy are iteratively updated.
Step 1: Initialize and = 0;
Step 2: Update the performance function ;
Step 3: Update the control actions ;
Step 4: The algorithm terminates, while , where is a predefined positive constant. The iteration index s is updated to , and the process returns to Step 2 for further iteration.
The objective is to guarantee convergence of both the control strategy and the local performance function to their respective optimal values. To establish and as , an essential lemma is presented below.
Lemma 1 ([
26]).
Startingfrom any initial admissible control policies , and are updated iteratively via Steps 2 and 3. It can be shown that is monotonically nonincreasing, i.e., . Theorem 1. Let and be generated by Step 2 and Step 3. As , converges to the optimal value , and converges to the optimum , i.e.,
Proof. Let
denote the value function at iteration
s, and define its pointwise limit as
By Step 2 and Step 3, for all
s, the following recursion holds:
First, we note that for any
, there exists an integer
such that for all
,
For any admissible
, as
, it follows that
Combining inequalities (
15) and (
16), for any
, we have
Since
is arbitrary, by letting
, we conclude that
For any admissible control
, a new performance index can be used to equivalently express the problem:
Furthermore, assume there exists a state
such that
. By recursively unfolding the definition of
, for a finite horizon
N (and noting that the terminal cost vanishes as
), we obtain
According to the definition of
, we have
By the principle of optimality,
is the minimal cost. Therefore,
This contradicts our previous assumption. Thus, it must hold that
for all
k.
serves as a global lower bound on the cost for any admissible policy, with equality attained under the optimal policy, i.e.,
Similarly, it holds that for any iteration s. Taking the limit as , we have . On the other hand, by the definition of as the minimal cost achievable by any admissible policy, cannot be smaller than . Therefore, the following equality holds: .
This completes the proof. □
This algorithm ensures voltage convergence to the optimal values under containment control and achieves accurate reactive power sharing.
3.2. Stability Analysis of Coordinated Voltage and Reactive Power Control
Section 3.1 proved that the policy iteration algorithm converged to the optimal control policy
. This section demonstrates that the application of this optimal policy ensures the asymptotic stability of the closed-loop system. Specifically, we prove that the containment voltage error
and the reactive-power-sharing error
converge to zero.
To analyze the stability, we employ a Lyapunov-based approach. The optimal value function derived from the Bellman equation serves as a natural candidate for a Lyapunov function for the closed-loop error dynamics of the ith DG.
Theorem 2. For the error dynamics described by (7) and (9), if the control policy is obtained from the converged policy iteration algorithm as described in Section 3.1, then the closed-loop system is asymptotically stable at the origin, i.e., and . Proof. The optimal value function
satisfies the Bellman optimality equation for the optimal policy
:
where
.
According to Definition 1, for any admissible control policy, the performance index
must be bounded. The optimal policy
is, by definition, an admissible policy. Therefore, the optimal value function
, which is the minimum possible value of
, must be finite.
From the definition of , since the weighting matrices , , and are all positive definite, for all k. The equality holds if and only if , , and .
For the infinite series in (
25) to converge to a finite value with a discount factor
, it is a necessary condition that the terms of the series approach zero, that is:
Since
is a constant, this implies:
Given that
is a sum of non-negative terms, for their sum to be zero, each individual term must be zero. Therefore, we must have:
Since
and
are positive definite matrices, this directly leads to the conclusion that the error states converge to zero:
This demonstrates that the origin of the error system is asymptotically stable under the optimal control policy . □
3.3. Dynamic Leader Election Algorithm
Containment control maintains voltage safety in microgrids by enforcing upper and lower bounds. However, conventional approaches usually predefine the upper-bound leader. If this leader experiences a heavy load, its voltage may decrease, violating the highest-voltage assumption and impairing effective reactive power sharing.
To dynamically adjust the containment control leader based on bus voltage, each DG must access the voltage of all DG-connected buses. However, due to the distributed communication architecture in microgrids, non-adjacent DGs cannot directly share information. Thus, a bus voltage estimation algorithm is required to enable indirect acquisition of voltage data among non-adjacent buses.
Let
denote the vector of bus voltage estimates by the
ith DG, where
represents the
ith DG’s estimate of the voltage at the bus to which the
jth DG is connected; furthermore,
. Then, the update rule for the estimated value
takes the form
where
specifies the
entry in the adjacency matrix. The first term,
, represents the difference between the
ith DG’s estimate and its neighboring
kth DG’s estimate of the bus voltage at the
jth DG. The second term,
, captures the error between the
ith DG’s estimate and the actual bus voltage at the
jth DG.
During each iteration of the microgrid controller, each DG estimates the voltages at all buses according to (
27). Based on the estimated vector
, the
ith DG determines whether it is elected as an upper- or lower-bound leader. The specific rules are given as follows:
If is the maximum value in , then node i is selected as the upper-bound leader.
If is the minimum value in , then node i is selected as the lower-bound leader.
If is neither the maximum nor the minimum value in , node i is not selected as a leader.
If there are multiple nodes whose values are equal to and these values are the maximum or minimum in , node i is elected as the upper- or lower-bound leader only if its index i is the smallest; otherwise, it is not selected as a leader.
Other nodes follow the same procedure to determine their leadership status. In the following, we prove the convergence of the bus voltage estimation algorithm.
Theorem 3. Consider the estimator dynamic for bus voltage in a microgrid, given by (27). Under Assumption 1, it holds that for all . Proof. Introduce the error variable , defined by where represents the estimation error of the bus voltage at jth DG, as estimated by the ith DG. Since is constant or varies very slowly in steady state, it follows that
Substituting the system dynamics yields:
By substituting the relation
into the above equation, the error dynamics can be expressed as
Next, define the Lyapunov function as
which satisfies
for all
t, and
equals zero only when
is zero for all
.
By differentiating (
30) with respect to time, we obtain
To facilitate further analysis, we separate the right-hand side of (
31) into two components and define
Noting that
, we exchange the indices
i and
k in
to obtain an equivalent form. Summing the two expressions yields
which implies that
Since every term is nonnegative and , we also have . Therefore, , which shows that as , and consequently, . Therefore, the ith DG’s estimation error for the jth DG’s bus voltage gradually converges to zero.
This completes the proof. □
Remark 2. In direct contrast to conventional containment control methods based on fixed leaders, as proposed in [10,11], this work addresses their well-known limitation in handling complex scenarios like sudden large load changes. The fixed-leader approach often fails to ensure accurate reactive power sharing under such conditions. To overcome this specific flaw, this paper introduces a novel DLE algorithm. This mechanism, based on bus voltage estimation, allows each DG to dynamically select the leader according to real-time operating conditions. By doing so, it enables accurate reactive power sharing precisely where the conventional method falters, providing a clear, practical demonstration of its superiority over the static, fixed-leader approach. 3.4. RL-Based Containment Control Implementation
To ensure voltage containment and accurate reactive power sharing, this section proposes a control method based on actor–critic reinforcement learning. The actor network generates the control policy, while the critic network evaluates and guides its optimization. Through online iteration, the algorithm converges to the optimal control. The implementation structure is shown in
Figure 2.
3.4.1. Critic Network
The critic network is designed to approximate the optimal value function, expressed as . This network adopts a three-layer back-propagation neural architecture. Define the input vector of the critic as , where denotes the quantity of neurons within the hidden layer, and and represent the weight matrices for the hidden and output layers, respectively. Accordingly, the hidden layer is supplied with as its input. A hyperbolic tangent activation function, , is employed in the hidden layer to capture smooth nonlinear relationships, with . The corresponding hidden layer output is . Ultimately, the output of the critic network at time k is represented by .
The error term for the critic network is defined by . To train the critic network, gradient descent is employed to minimize the error , resulting in the following objective function: .
The iterative update laws for the weights
and
are given by
where
l is the neural network iteration index, and
represents the learning rate. For the output layer weights:
3.4.2. Actor Network
The actor network is constructed to approximate the optimal control policy . It is implemented as a three-layer neural network, where the hidden layer consists of neurons and employs the hyperbolic tangent activation function. Denote and as the weight matrices for the hidden and output layers, respectively. The hidden layer output can be expressed as , where . The final output of the network is given by , with .
By continuously adjusting the parameters and , the network aims to derive the optimal control input based on . The parameter update is guided by minimizing the objective function , where the error term is defined by .
The purpose of the optimization problem is to ensure that the actor network produces an optimal control action that minimizes the value function. Similar to the critic network, the weights
and
are updated using gradient descent:
Similarly, the weights of the output layer are updated by
Remark 3. As a typical nonlinear system, island microgrids present challenges for model-based controller design due to difficulties in obtaining practical parameters such as resistance and inductance [17,18]. These methods are also sensitive to measurement errors, further complicating controller design. In contrast to [18], this paper proposes a data-driven online reinforcement learning approach that does not require extensive offline data processing. The control policy is iteratively optimized by minimizing the value function, enabling accurate reactive power sharing and effective voltage control in the microgrid. Remark 4. From a practical standpoint, the proposed DLE and model-free RL framework is designed to address key operational challenges in real-world offshore microgrids. First, the DLE algorithm provides crucial operational flexibility and resilience. It enables the microgrid to autonomously adapt to the harsh and dynamic conditions of offshore environments (e.g., sudden load changes, volatile renewables), overcoming the rigidity of fixed-leader schemes to ensure stability without manual intervention. Second, the model-free RL controller eliminates the reliance on an accurate system model, which is a significant practical challenge because obtaining line parameters is often both difficult and costly. By learning directly from measurement data, our approach simplifies deployment, reduces commissioning costs, and enhances robustness against parameter uncertainties and system aging. Collectively, these features make the proposed framework not only technically effective but also practical, cost-efficient, and resilient for real-world offshore applications.
4. Simulation Results
As shown in
Figure 3, the offshore island AC microgrid system consisted of four renewable generation units. The algorithm proposed in this paper was validated on a simulation model built on the Simulink platform. The validation strategy across the following cases was deliberately designed to highlight the practical value of the proposed model-free approach. Instead of a direct numerical comparison against a model-based controller, which can be misleading (as its performance is entirely dependent on an idealized, perfectly accurate model that is unavailable in reality), our validation focused on two key aspects. First, in Case 4.1, we conducted a head-to-head comparison with a conventional fixed-leader method [
10] to demonstrate that our DLE algorithm solved a fundamental operational flaw. Second, in Cases 4.2 and 4.3, we verified that our model-free controller robustly achieved all control objectives under challenging conditions (load changes and plug-and-play), thereby proving its effectiveness and practical viability on its own terms. Following the approach in [
10], the allowable voltage deviation was set to ±1%, and the rated bus voltage
was selected as 311 V, which served as the system design objective. Specific simulation parameters are provided in
Table 1, and other related parameters were as follows:
was selected to be
as the discount factor. The performance index employed the following weighting parameters: both
and
were diagonal matrices with diagonal elements equal to 1, and
. Both the actor and critic networks employed five hidden neurons.
4.1. Dynamic Leader Election
This case was designed to validate the effectiveness of the proposed dynamic leader election (DLE) algorithm. To achieve this, we first established a benchmark scenario from
to
by implementing the fixed-leader containment control described in [
10]. The purpose of this benchmark was to replicate a well-known limitation of conventional methods. As shown in
Figure 4, during this benchmark period, while the system successfully maintained voltage containment (i.e., all bus voltages remained within the safe range), the reactive-power-sharing ratio failed to achieve the desired
. This outcome was an expected consequence of the fixed-leader topology: since DG1 was designated as the upper-limit leader, its bus voltage was consistently maintained at the highest level, which inherently restricted reactive power flow and prevented equitable sharing among the DGs. This scenario effectively highlighted the specific problem that the data-driven model-free approach was designed to overcome without relying on pre-configured roles or precise system parameters. Subsequently, at
, the load distribution was changed by transferring the load on Bus 3 to Bus 2. With the leaders still fixed as DG1 and DG4, it can be observed from
Figure 4 that although the microgrid remained effective in voltage containment control, reactive power sharing was still not fully achieved.
At
, the proposed dynamic leader election algorithm was enabled. As shown in
Figure 4, the dynamic leader election algorithm allowed the upper-limit leader to be automatically elected as DG4 and the lower-limit leader to be automatically elected as DG1. Under the effect of the dynamic leader election algorithm, the microgrid not only achieved voltage containment control but also brought the reactive-power-sharing ratio to
, successfully achieving precise reactive power sharing. At
, the load on Bus 2 was transferred back to Bus 3. From
Figure 4, it can be observed that after a brief transient process, the microgrid once again achieved voltage containment control, and the reactive-power-sharing ratio was restored to
, ensuring precise reactive power sharing. The results confirm that the proposed algorithm adaptively selects leaders based on bus voltage magnitude.
4.2. Load Variation
Through simulation and comparative experiments, this study verified the effectiveness of the proposed dynamic leader election algorithm and model-free reinforcement learning algorithm. The proposed approach achieved both voltage recovery and accurate reactive power sharing. First, from
to
, the microgrid employed only the conventional PI control strategy described in [
7], as shown in
Figure 5. During that phase, the load on Bus 1 of the microgrid was
, and the load on Bus 3 was
. It can be seen that the microgrid voltage was not restored to the safe level, nor was the reactive-power-sharing ratio precisely maintained at
.
At
, the proposed dynamic leader election and model-free reinforcement learning algorithms were enabled. As shown in
Figure 5, the voltage quickly recovered to within the safe constraint range, and the reactive-power-sharing ratio also reached
, achieving accurate reactive power sharing. To further validate the robustness of the algorithms under load variation, at
, the load on Bus 3 (DG3) was increased by
. From
Figure 5, it can be observed that after experiencing a brief transient process, the microgrid voltage returned to the steady state and remained within the safe range. Simultaneously, the reactive-power-sharing ratio again reached
, achieving precise reactive power sharing.
Figure 6 shows the evolution of actor–critic neural network weights for DG1 during the simulation. All weights converged to stable values, as illustrated.
4.3. Plug-And-Play Capability
The plug-and-play capability of microgrids enables the rapid integration or removal of DGs, allowing the system to adapt to load changes and equipment failures, thereby improving overall flexibility and scalability. To comprehensively and realistically validate the plug-and-play performance of the proposed algorithm, this section designs an experiment that includes a “plug-out” event and a “plug-in” process that mimics real-world engineering scenarios.
The simulation results are shown in
Figure 7. During the time period from
to
, the microgrid operated stably with all four DGs. The proposed algorithms achieved voltage containment control and accurate reactive power sharing with a ratio of
. At
, DG4 was disconnected (plug-out) to simulate its removal from operation. The simulation results show that after DG4 was removed, the power deficit was automatically compensated by the remaining DGs. The system voltage, after a brief transient, quickly stabilized and remained within the safe constraint range. Meanwhile, reactive power was redistributed among the three remaining DGs, reaching a new stable sharing ratio of
. To simulate the reconnection process of a DG, at
, DG4 initiated the synchronization process with the microgrid. During that phase, DG4 adjusted its output voltage frequency, phase, and amplitude to match the microgrid’s parameters in preparation for grid connection. At
, upon successful synchronization, DG4 was physically connected to the microgrid, and its controller was activated. As observed in
Figure 7, the system seamlessly reintegrated DG4. The voltage remained stable, and the reactive-power-sharing ratio, after a short dynamic adjustment, was accurately restored to the initial
state.
This complete test, encompassing both a plug-out event and a realistic plug-in process, robustly demonstrates that the proposed dynamic leader election and model-free reinforcement learning algorithms provide the microgrid with plug-and-play capability, ensuring safe and stable operation under dynamic topological changes.
5. Conclusions
This paper developed a secondary control method for offshore island microgrids based on a model-free reinforcement learning algorithm and a dynamic leader election mechanism. First, by combining the microgrid’s voltage containment error and reactive power sharing error, a value function for policy iteration was constructed. Then, a dynamic leader election algorithm was designed, enabling different DGs to be dynamically elected as leaders to facilitate accurate reactive power allocation. Subsequently, a model-free reinforcement learning algorithm was developed, which relied solely on real-time measurements of voltage and reactive power without requiring a complex system model.
However, it is important to acknowledge that this study was conducted under the assumption of an ideal island microgrid model, where factors such as communication delays, external disturbances, and potential cyber-attacks were not considered. Communication delays, which are inherent in distributed control systems, could introduce time lags in the information exchange among DGs. This might affect the timeliness of the dynamic leader election process and degrade the performance of the reinforcement learning algorithm, potentially leading to oscillations or even instability. Similarly, other disturbances, such as measurement noise and unmodeled dynamics, could impact the accuracy of the data-driven RL algorithm, which is highly dependent on the quality of measurement data. Addressing these practical challenges is crucial for real-world implementation. Therefore, these aspects will be the focus of our future work. We plan to investigate and develop more robust control strategies that can tolerate communication delays and are resilient to various disturbances. This may involve integrating predictive control mechanisms or designing delay-compensation techniques within the RL framework. To validate the effectiveness and robustness of the enhanced methods, we intend to conduct more comprehensive hardware-in-the-loop simulations or tests on a physical experimental platform.