Learning-Based Model Predictive Control of DC-DC Buck Converters in DC Microgrids: A Multi-Agent Deep Reinforcement Learning Approach

: This paper proposes a learning-based ﬁnite control set model predictive control (FCS-MPC) to improve the performance of DC-DC buck converters interfaced with constant power loads in a DC microgrid (DC-MG). An approach based on deep reinforcement learning (DRL) is presented to address one of the ongoing challenges in FCS-MPC of the converters, i.e., optimal design of the weighting coefﬁcients appearing in the FCS-MPC objective function for each converter. A deep deterministic policy gradient method is employed to learn the optimal weighting coefﬁcient design policy. A Markov decision method formulates the DRL problem. The DRL agent is trained for each converter in the MG, and the weighting coefﬁcients are obtained based on reward computation with the interactions between the MG and agent. The proposed strategy is wholly distributed, wherein agents exchange data with other agents, implying a multi-agent DRL problem. The proposed control scheme offers several advantages, including preventing the dependency of the converter control system on the operating point conditions, plug-and-play capability, and robustness against the MG uncertainties and unknown load dynamics.


Introduction
Microgrids (MGs) are a group of interconnected loads and distributed generations (DGs), and they are usually interfaced to the grid through power converters to reduce pollution and power transmission losses with the flexibility of different installation location. This is an important concept for future distribution systems and will be more utilized in renewable energy integration that is the fastest-growing energy source globally [1,2]. The MGs can be utilized in both islanded and grid-connected operation modes [3]. The use of clean and sustainable energy resources such as photovoltaic systems, batteries, and chargers, has created a lot of interest in DC-MGs [4]. DC-MGs also have several benefits in comparison with their AC counterparts. For example, controlling reactive power or unbalanced electrical signals is not a problem in a DC-MG, while protection is still a challenging task [5]. The critical issue for AC islanded MGs is to ensure voltage and frequency stability when inverters are connected to power sources with lines and loads [3,6].
The power converters' role is like voltage sources interface between loads and different types of sources that are responsible to share the power based on the availability and capability of the energy sources [7]. The most typical interfaces employed in DC-MGs are DC-DC buck and boost converters. Once the converters are tightly controlled, they act as constant power loads (CPLs) [8]. CPLs hold negative impedance, which may induce instability in the DC bus, and consequently, the whole MG may fail [9]. The CPL's impact becomes more critical once MG works in islanded mode due to decreased damping.
Various solutions have been proposed in previous studies to deal with this problem. For example, introducing virtual impedance loops to converter control systems offers promising solutions for increasing the precision of power sharing and damping oscillatory currents in DC-MGs [10]. There are diverse control strategies for current and voltage control of DC-DC converters, including sliding mode control (SMC), fuzzy logic, proportional-integral (PI), model predictive control (MPC), and state-dependent Riccati equations control [11]. Linear controllers are the most straightforward control methods to reach the voltage regulation in DC-MGs [12]. These methods evaluate the network's stability around only one equilibrium point [13,14] supply load power. An integrated CPL raises the degree of nonlinearity in DC-MGs. Thus, traditional linear strategies are questioned and face stability restrictions. A nonlinear PI stabilization controller has been developed in [15] to ensure stability in DC-MGs. This method has the challenge of variable switching frequency as it affects converter efficiency [16]. The authors in [9] have proposed a nonlinear SMC to develop a control rule that guarantees an area larger than local stability while improving large-signal stability. The main drawback of SMC is that it is challenging to impose restrictions or control abstract quantities. To cover these drawbacks, FCS-MPC has been identified as one of the most favorable controllers for power electronic applications due to its capability over real-time solutions to multiple objectives and constraints [17,18]. The performance of FCS-MPC is deeply influenced by the weighting coefficients, the tuning of which is still a challenge to be undertaken. In this regard, Ref. [8] has employed an artificial neural network method in off-line mode for weighting coefficient design in uninterrupted power supply (UPS) system. This method, however, demands a high number of calculations for the adaptation and training process, and also the conducted analyses for identifying the optimal values of weighting coefficients are dependent on operating conditions, which may give rise to a flawed performance of the control system.
Recently, model-free intelligent controllers such as fuzzy logic and neural network have been developed to decrease the sensitivity to modeling inaccuracy. The main characteristic of intelligent controllers is the model-free design that enables them to manage model non-linearity, complexity, and uncertainty in power electronic applications. Nevertheless, these methods are only suitable for a specific time interval as suffering from the lack of the capability to learn online [19]. With the rapid development in machine learning, reinforcement learning (RL)-based techniques have gained significant attention. They have become a vital mechanism in developing intelligent networks. RL approaches have successfully solved complicated problems by integrating them with a deep neural network, called DRL [20]. As a DRL algorithm, Deep Q Network (DQN) is developed to address the limitations of conventional Q networks [21]. The DQN has been utilized in different applications such as Automatic Underwater Vehicles [22], Aerial Robots [23], and quadrotor control [24]. Nevertheless, DQN utilizes discrete steps for estimating the value function, limiting its use for problems with continuous steps. Hence, a deep deterministic policy gradient (DDPG) algorithm is formulated to address this challenge [24,25]. In [26], the DRL is employed as a voltage controller for a DC-DC buck converter. In [27], the application of the DRL method is investigated for optimizing the weighting coefficient for an FCS-MPC controlled inverter in a UPS system. However, those studies address a single-agent RL problem, which may not suit multiple-inverter systems like DC-MGs.
Motivated by the previous discussion, this paper uses the FCS-MPC to improve the voltage regulation of DC-DC buck converters used in a DC-MG. To avoid the dependency of the converter control system on the operating conditions, the weighting coefficients appearing in the FCS-MPC objective function for each converter are regulated in an online fashion via distributed DRL algorithm. The DRL problem is solved by a DDPG algorithm in a critic-actor framework. The DRL agent is trained for each buck converter in the MG, and the weighting coefficients in the FCS-MPC are obtained based on reward computation with the interactions between the MG and agent. Under the proposed strategy, each agent is established at the local converter to reach the optimal purpose simultaneously. The simulation validations under different operational conditions are provided to illustrate the effectiveness of the proposed control scheme. Table 1 summarizes a taxonomy of existing publications in the area and compares previous studies in this field to highlight the main contributions of this paper. Table 1. Comparison of the contributions of this paper with the previous studies.

Refs.
Controller PnP Capability Robust Adaptive Multi DG Units CPL [7] ANN-Backstepping -- FCS-MPC --- [20] ANN-MPC --- [26] DDPGiPI -- [27] RL The contribution of this paper can be summarized as follows: • A learning-based FCS-MPC is proposed to regulate the output voltage of DG units in a DC-MG. A multi-agent DRL-based approach is used to provide an online and adaptive tuning of weighting coefficients of the FCS-MPC. • Unlike the FCS-MPC with constant coefficients, which are typically designed for a specified operating condition, the proposed approach avoids the dependency of the converter control system on the operating conditions. • Usually, the control design of the converters follows this presumption that the CPLs are ideal, while in practice, the CPLs are of unknown and/or time-varying character. Hence, the performance of the proposed controller is investigated against the power changes in the non-ideal CPLs. • One of the critical issues in MGs is DGs' plug-and-play (PnP) operation due to the inherently discontinuous nature of renewable energy sources. To address this issue, the dynamic performance of the proposed controller is examined under the PnP operation of DG units.

Model of Microgrid
The diagram of a single-bus DC-MG with multiple DG units and loads is depicted in Figure 1. The buck converter used for each DG is regulated by the duty ratio of an IGBT switching to maintain the output voltage stable. The DGs are connected to a common DC bus through LC filters. The DC-MG is assumed to operate in an islanding model. For DG i unit, the following differential equations can represent the output voltage and current of the converter: where I L i and I t i are the currents of load and converter, respectively; V ti is the converter's output voltage; V oi represents the capacitor voltage; and L ti and C ti are the filter parameters.
There is an assumption that buck converter dynamics, inherently switching, have been averaged over time. Nevertheless, this is a soft approximation for converters operating at high frequencies.
The output voltage and current for DG i can be described in the state-space form as follows: where The corresponding vectors and matrices are as follows: Continuation of (3), the overall model of the MG consisting of three DGs (i.e., DG i , DG j and, DG k ) is expressed by: It should be mentioned that due to the neglect of line dynamics, the matrices A ik , A jk , A ji , A ki , A ij and A kj are equal to zero. Figure 2 illustrates the DRL-based FCS-MPC-operated converter. In this approach, a proper control command is obtained based on the prediction from the converter model and a objective function (OF). The Equations (2) and (3) describe the continuous state-space model of a dc-dc buck converter. To get a discrete representation appropriate for a digital control system, this paper uses the zero-order hold (ZOH) discretization technique to discrete the continuous-time model. The discrete state-space model with a sampling time Ts can be described as follows [19]:

Proposed Controller Design
where

Cost function minimization
Predictive model These equations are used in the FCS-MPC prediction step. System model and OF design are the two main stages of FCS-MPC controller design. The switching signal determines the voltage vector and has two initial states of zero and one. FCS-MPC method is mainly used in digital controls and works based on synchronized switching and sampling instants [28]. The main goal of the control system is to properly adjust the voltage V ti so that the voltage V oi can follow the reference voltage precisely. The fundamental function of FCS-MPC is to predict values of V oi and I ti and apply optimal V ti based on a OF. The OF with minimum value is then executed to the converter. Therefore, determining an appropriate OF is a prominent part of the FSC-MPC approach.

Output voltage of other DGs
OFs with multi-step prediction horizons have been offered to enhance the steady-state performance of the control system, which are typical for high-power multilevel converters. In contrast, a single-step prediction is typically a better choice in a converter with high switching frequencies. This implies more performance areas by using longer prediction horizons [29]. It should be noted that the implementation of a single-stage horizon requires less computation and is flexible in integrating linear and non-linear control objectives and constraints. This study uses a OF with a single-step prediction horizon. For dc voltage regulation on the converter, the OF of the single-step horizon is expressed as follows: where V re f is the voltage reference.
Additional current reference term is added to improve the steady state performance: where I L is the load current; V oi (k + 1) and I ti (k + 1) are the predicted voltage and current; and ω r = 2π f r is the angular frequency. The g con and g c terms are multiplied with weighting coefficients λ v and λ der . In addition, the current limiting term h lim , and switching penalization term sw, are also defined: where |∆S(i)| is 1 if switch change happens at instant i and 0 otherwise. The terms expressed in (11)-(13) are added to (10), which eventually produces the modified OF as: As can be seen, the system performance is highly influenced by the weighting coefficients λ v ,λ der , and λ sw , which should be adjusted optimally. In [19], these weighting coefficients were tuned offline. However, once the operating point varies significantly, securing a good response is a challenging task. In this paper, the DRL approach is used to adjust the weighting coefficients in a online manner and quick way and thereby improve the performance of the converter control system. The design process of the multi-agent DRL-based regulation scheme will be discussed in the following section.

Multi-Agent DRL-Based Regulation Scheme
In this paper, the problem of the design of weighting coefficients, i.e., λ v ,λ der and λ sw , is formulated in a multi-agent DRL framework. Each DG has its controller, whose weighting coefficients are determined by the DRL agent in a distributed manner. In the distributed procedure, the DRL agent associated with each DG exchanges data with others. Thus, each DG uses a local reward function. The multi-agent DRL environment is a network model of a DC-MG including three DGs. DRL agents operate together to regulate the output voltage. A schematic of the proposed strategy is shown in Figure 3. This method includes two phases of offline and online learning. In the offline phase, the DRL agents follow a centralized learning process to explore the environment. A reward function is then generated to assess the actions generated by the agents. By updating critic and actor networks, each agent generates the optimal control command (updates the weighting coefficients) to improve the system performance. In the online phase, agents will take action in a distributed framework to determine weighting coefficients. Distributed control is one of the most desired communication-based control techniques that does not need a central controller. Each agent adjusts the FCS-MPC coefficients of its DG based on its observations e i . The observation e i for each DG, considering the communication network (communication links transfer measured data of each DG unit), is an error between the average of voltages broadcasted from each DG and the reference voltage, which is expressed by where N is the number of DGs in the MG.  The DRL concept is to find the best policy together with the states and actions while getting maximum rewards through the interaction between the agent and the environment. The DRL problem is described as a Markov decision-making process (MDP). An MDP is defined with 5 parameters (s, A, P, R, γ) so that the s is the state, A is action, P = s × A × s ⇒ [0, 1] is a state transition probability, where P = p(s t+1 | s t , a t ), R : s × A ⇒ R is reward and γ ∈ [0, 1] is the discount factor. At any time step, the DRL agent monitors the state s t and selects the appropriate action a t according to the policy π(a t | s t ). Then, the agent observes the reward value r t appropriate to this action and determines the next state (s t ) accordingly. The definition of discount reward with γ discount factor is as follows [26]: The goal is to maximize the discount return, that is: where En denotes the environment. Actions a i are determined based on the policy π. In most DRL problems, the action-value function (AVF) expresses the anticipated return G t after the action a t is applied to the state s t , and the AVF in DRL is illustrated as: Therefore, the main purpose of DRL is to calculate the AVF Q π (s t , a t ) and find the appropriate policy value π accordingly.
Many RL methods use the Bellman equation to estimate the AVF, which is given by: In the following, the DDPG algorithm is used to design the DRL agents. Figure 4 shows the execution process of the DDPG, consisting of two networks, the actor and the critic. The actor-network adjusts the weight of θ µ of policy µ(s | θ µ ) based on observation or state to the corresponding action, and the critic network modifies the weights of action function Q(s, a | θ Q ). Critic coefficients are updated through minimization of the following loss function: where

Q(s t, s t )
(a) (b) λ iv λ iv λ isw a n e i Furthermore, the actor coefficients θ µ are updated as: where ρ is a discounted distribution; and β is a specific policy to the current policy π. Also, an exploration noise W has been added to the actor actions (i.e., a t = µ(s t | θ µ ) + W) to improve the training process [30]. The design of actor and critic networks in this paper consists of an input layer, an output layer, and three hidden layers, including 80, 80, and 30 neurons between the input and output layers, shown in Figure 4. The input signal to the actor network is a vector state of the e, and its outputs are λ v , λ der and λ sw . The developed control system aims to minimize the output voltage error in the shortest possible time to stabilize the MG-DC. Hence, the reward function is considered as: Based on the reward signal, the weight coefficients of the actor and critic networks are trained in such a way that the error between the reference voltage V re f and the average value of the output voltage of the DGs is minimized. Flowchart of the proposed multi-agent DRL based design of the weighting coefficients in the FCS-MPC is shown in Figure 5. The DDPG design process for each agent is available in Algorithm 1.
2: Initialize Q and µ networks based on new weights θ Q ← θ Q , θ µ ← θ µ 3: for episode = 1 to M do 4: Start with Ornstein-Uhlenbeck Noise (OU) for exploration and get the initial observation state s 1

7:
Apply action a t to the environment and observe the e as the next state s t+1 .

8:
Calculate the reward r t according to Equation (23) by difference between the values of the simulated and observed behavior. Save (s t , a t , r t , s t+1 ) into the replay buffer F. 10: Sample random minibatch of m transition from F. 11: Update critic with the loss: Update the actor policy based on the sampled policy gradient: Update the target network: end for 16: end for

Simulation Results
Various case studies are conducted to assess the performance of the proposed control scheme. The DC-MG shown in Figure 1, including three DGs interfaced with dc-dc buck converters, is simulated in MATLAB/Simulink software environment. The specifications of the studied network are available in Table 2, and the design parameters of the DDPG algorithm can be found in Table 3. The proposed scheme is tested in the following scenarios: • Unknown load dynamics • Variation of input voltage • PnP operation • Variation of reference voltage Table 2. Parameters of the test system.

Study 1: Unknown Load Dynamics
This study demonstrates the robust performance of the proposed control scheme against sudden load changes. For this purpose, the load power declines to half of its initial value at t = 0.2 s, returning to the previous value at t = 0.3 s. The deviations in voltage and current of the load are shown in Figures 6 and 7, respectively. As shown in Figure 6, the load voltage remains constant once the load power changes, indicating the proper performance of the proposed scheme. The output voltage of each DG is illustrated in Figure 8. As the figure presents, the overshoot occurrences are minor, denoting the effectiveness of the proposed control scheme. The generated weighting coefficients for each DG by the DRLbased approach under the applied load changes are presented in Figures 9-11. The figures reveal that the DRL-based method regulates the weighting coefficients such that a stable operation of the DC-MG is achieved.

Study 2: Input Voltage Variations
This study examines the robustness of the proposed control scheme under the input voltage changes. A step increase of 10 v at t = 1 s is assumed in the input dc voltage of the buck converters. The load voltage and current are shown in Figures 12 and 13, respectively. As the figure indicates, the proposed scheme could withdraw the offset induced by input voltage deviation in a short time. The deviations in output voltages of DGs are shown in Figure 14. As shown, by use of the proposed scheme, the voltage fluctuation of DG1 is less than 1%, which is quite satisfactory. Similar argument is correct about the rest of DGs. From the figures it can be concluded that the current has more changes in the return of voltage to the reference value. The generated weighting coefficients by the DRL method are presented in Figures 15-17 in response to the input voltage variation.

Study 3: PnP Operation
Here, the PnP capability of the proposed scheme is examined. For this goal, it is assumed that DG3 is disconnected from the MG at t = 0.3 s, and then, is connected at t = 0.4 s. The load voltage and current are displayed in Figures 18 and 19. As it can be seen, the voltage variations in stable time are less than 0.02 volt and has achieved the control goals with the quick reaction time. Also, the variation of load current is almost equal to the nominal value of the system. Figure 20 shows the output voltage of each DGU. As seen at t = 0.3, the voltage of DG3 when it is plugged out drops about 0.05 volt, and the voltage of DG1 and DG2 increase by 0.01 volt, which is almost equal to the reference voltage. The advantage is that each unit is controlled separately, which is not possible in centralized controls. As well, the generated weighting coefficients by DRL are illustrated in Figures 21-23. The DRL generates the weighting coefficients online and dynamically during transient state and after disconnection of DG3 from the DC-MG.

Study 4: Variation of Reference Voltage
The voltage reference changes may be required to adjust the current between the DG units or to control the state of charge of batteries embedded in the islanded MG. In this study, the performance of the proposed strategy is evaluated under the variation of reference voltage. For this end, at t = 0.3 s the reference voltage is reduced to 180 volts and at t = 0.6 it returns to the initial value of 200 volts. It means, the reference voltage has changed by 10%. A comparison is also made to examine the effect of weighting coefficients tuned by the multi-agent DRL approach and with those tuned by a trial and error method. Figures 24 and 25 illustrate the voltage and current of the load. As shown, the voltage reaches the reference value in a proper time without any ripple, overshoot, or undershoot. Similar view is correct about the current, where it changes in such a way that it can produce a constant power. Figure 26 shows the output voltage of each DG. The weighting coefficients variation of FCS-MPC in DG1, DG3 and DG3 are available in Figures 27-29.
The figures indicate that the DRL-regulation scheme regulates the weighting coefficients in such a way that the least fluctuations are achieved in the dynamic responses. Another study is conducted under reference voltage changes, wherein the weighting coefficients are fixed. The coefficients λ v ,λ der and λ sw are considered equal to 4.9, 4.65, and 5, respectively. The load voltage and current in this case when the reference voltage changes is shown in Figures 30 and 31, respectively. As it can be observed, although the control system with fixed coefficients can keep the voltage constant at 200 volts, when the reference voltage changes, it can not follow the new value correctly, and there is an error of 2.8 volts.

Conclusions
This study proposed a real-time solution employing the multi-agent DRL algorithm to design the weighting coefficients appearing in the FCS-MPC used for buck converters interfaced with CPLs in a DC-MG. A DDPG method is employed to learn the optimal weighting coefficient design policy. Minimizing the voltage and current divisions of each DG were the main objectives behind the DRL-based FCS-MPC method. The proposed method's key features are the online learning capacity, minimal computational complexity, and no need for prior knowledge of MG dynamics. Finally, the simulation results obtained from a benchmark DC-MG with three DGs demonstrated the effectiveness of the proposed solution with different operational conditions. For example, it was shown that, by use of the proposed scheme, the voltage fluctuation of DG1 under input voltage variations is less than 1%, which is quite satisfactory. The results confirmed that the proposed control scheme: (1) has superior performance in comparison with FCS-MPC with fixed weighting coefficients; (2) indicates a robust performance against the uncertainties such as input and reference voltage variation; (3) deals with the power changes in the non-ideal CPLs; (4) presents the PnP capability; and (5) avoids the dependency of the converter control system on the operating point conditions, thereby supporting a wide range of operating conditions. Data Availability Statement: Not applicable.

Conflicts of Interest:
The authors declare no conflict of interest.