Design of Robust Adaptive Nonlinear Backstepping Controller Enhanced by Deep Deterministic Policy Gradient Algorithm for Efficient Power Converter Regulation

Ghamari, Seyyed Morteza; Aziz, Asma; Ghahramani, Mehrdad

doi:10.3390/en18184941

Open AccessArticle

Design of Robust Adaptive Nonlinear Backstepping Controller Enhanced by Deep Deterministic Policy Gradient Algorithm for Efficient Power Converter Regulation

by

Seyyed Morteza Ghamari

,

Asma Aziz

^*

and

Mehrdad Ghahramani

School of Engineering, Edith Cowan University, Joondalup 6027, Australia

^*

Author to whom correspondence should be addressed.

Energies 2025, 18(18), 4941; https://doi.org/10.3390/en18184941

Submission received: 7 July 2025 / Revised: 27 August 2025 / Accepted: 16 September 2025 / Published: 17 September 2025

(This article belongs to the Special Issue Power Electronics for Smart Grids: Present and Future Perspectives II)

Download

Browse Figures

Versions Notes

Abstract

Power converters play an important role in incorporating renewable energy sources into power systems. Among different converter designs, Buck and Boost converters are popular, as they use fewer components and deliver cost savings and high efficiency. However, Boost converters are known as non–minimum phase systems, imposing harder constraints for designing a robust converter. Developing an efficient controller for these topologies can be difficult since they exhibit nonlinearity and distortion in high frequency modes. The Lyapunov-based Adaptive Backstepping Control (ABSC) technology is used to regulate suitable outputs for these structures. This approach is an updated version of the technique that uses the stability Lyapunov function to produce increased stability and resistance to fluctuations in real-world circumstances. However, in real-time situations, disturbances with larger ranges such as supply voltage changes, parameter variations, and noise may have a negative impact on the operation of this strategy. To increase the controller’s flexibility under more difficult working settings, the most appropriate first gains must be established. To solve these concerns, the ABSC’s performance is optimized using the Reinforcement Learning (RL) adaptive technique. RL has several advantages, including lower susceptibility to error, more trustworthy findings obtained from data gathering from the environment, perfect model behavior within a certain context, and better frequency matching in real-time applications. Random exploration, on the other hand, can have disastrous effects and produce unexpected results in real-world situations. As a result, we choose the Deep Deterministic Policy Gradient (DDPG) approach, which uses a deterministic action function rather than a stochastic one. Its key advantages include effective handling of continuous action spaces, improved sample efficiency through off-policy learning, and faster convergence via its actor–critic architecture that balances value estimation and policy optimization. Furthermore, this technique uses the Grey Wolf Optimization (GWO) algorithm to improve the initial set of gains, resulting in more reliable outcomes and quicker dynamics. The GWO technique is notable for its disciplined and nature-inspired approach, which leads to faster decision-making and greater accuracy than other optimization methods. This method considers the system as a black box without its exact mathematical modeling, leading to lower complexity and computational burden. The effectiveness of this strategy is tested in both modeling and experimental scenarios utilizing the Hardware-In-Loop (HIL) framework, with considerable results and decreased error sensitivity.

Keywords:

power converters; backstepping method; nonlinear stability; reinforcement learning algorithm; grey wolf optimization; hardware-in-loop testing

1. Introduction

DC/DC power converters have been crucial in industrial applications due to their ability to control output voltage with the fewest possible components. The topologies of the Boost and Buck converters chosen here function in step-up-down modes. Nevertheless, a non-minimum phase situation is caused by the Boost converter’s zero in the right half-plane [1,2]. In high-frequency usage, discontinuous results and exceptional non-linearity are documented for these structures, despite their many benefits. This issue leads to tracking control problems by causing the feedback controller to deviate from the reference signal. Additionally, the system’s states may suffer as a result of the internal stability issue. Different kinds of controllers with distinct features have been developed in order to provide gentle switching and smooth operation [3,4]. In the domain of power electronics, particularly in DC–DC converter applications such as Buck and Boost converters, the inherent nonlinearity of the system dynamics—arising from switch-mode operation, time-varying loads, and input voltage fluctuations—demands the deployment of nonlinear control strategies to ensure robust performance under diverse operating conditions.

In contrast to linear control methods that usually cannot ensure stability and tracking error when there are parameter variations and external disturbances, nonlinear controllers can provide a superior dynamic response, superior voltage regulation, and disturbance rejection. Consequently, nonlinear control techniques are becoming more important because of their capacity to guarantee robust regulation of the voltage, high transient response, and stability in the face of parameter uncertainties [5]. Sliding Mode Control (SMC) is one such technique, and it is very robust; however, it is prone to chattering, which can destroy switching components [6,7]. Adaptive Model Reference Control (MRAC) is a parameter tracking controller that can converge slowly or be sensitive to persistent excitation [8,9]. Feedback linearization is precise and highly sensitive to good modeling [10,11], whereas Passivity-Based Control (PBC) is theoretically robust but lacks flexibility and ease of tuning [12,13,14]. Neural networks and fuzzy logic controllers provide intelligent behavior; however, they require significant computational effort or are not scalable [15,16,17,18,19]. Backstepping Control (BSC) has attracted attention because it is a recursive Lyapunov-based control design method that can ensure global asymptotic stability and robustness to matched disturbances [20,21]. Recent works have also used BSC to power converters with good results in voltage tracking and robustness. To further increase dynamic performance, adaptive BSC frameworks have been presented, where subsystems are controlled using nonlinear control laws and adaptive tuning rules, which allow real-time response to parameter changes [22]. These adaptive techniques have greatly enhanced transient response and minimal steady-state error under variable operating conditions. There are, however, two major limitations. First, adaptive BSC is very sensitive to the design of an efficient online adaptation mechanism, which can be computationally demanding and vulnerable to noise. Second, the controller must use gain values that are initially well tuned, which are typically obtained through trial and error or by empirical estimation and therefore influence the convergence speed and overall performance.

The last few years have seen the introduction of Artificial Intelligence (AI) and Machine Learning (ML) approaches, especially Reinforcement Learning (RL), into the controllers of power converters, which can offer the prospect of online optimization, real-time adaptation, and better dynamic response under uncertain circumstances [23,24]. Conventional control techniques have proved to be effective in controlling nonlinear systems; however, they are commonly fixed gain/precisely model-based and lack flexibility. To overcome these limitations, there has been recent interest in integrating RL and classical approaches. As an example, the combination of DDPG with high-order SMC observer has demonstrated improved voltage regulation of DC–DC Boost converters, with reduced chattering and steady-state error [25]. In a similar manner, RL-aided MPC frameworks have been introduced, where RL agents adapt the weighting factors in Finite-Set MPC, decreasing switching frequency and enhancing harmonic performance in power converters [26]. In another work, fuzzy logic-based Q-learning controllers were designed for VSC-HVDC systems, which provide adaptive reference tracking and lessen reliance on the empirical tuning of PI gains [27]. Reinforcement learning was also successfully used in CLLLC converters to optimize performance over a wide range of operating points, thus considerably easing the design task [28]. Furthermore, the authors have paid attention to safety-aware RL implementations, where online agents optimize the control of the converter without unstable learning-induced behavior [29]. Various publications show that it is also possible to directly use RL to regulate voltages and currents. Specifically, deep Q-networks (DQNs) and DDPG agents have been used to control Buck and Boost converters to track voltage, giving better transient response and flexibility compared to a fixed-gain controller [12]. Another model-free DRL approach was developed to stabilize the output voltage of a DC microgrid when using a Buck converter, and it demonstrated dependable performance without modeling the system in detail [30]. Likewise, discrete-time RL-based controllers have been demonstrated to be superior to conventional PID in efficiency and robustness when applied to a practical converter system [31]. The application of RL to nonlinear Boost converter control by Zandi has shown that RL can autonomously learn the optimum duty cycles when the load and voltage vary, and it outperforms manually tuned gains [32]. Combined, these studies highlight that RL can be used both as a controller on its own and as a versatile adaptive layer on top of traditional control architectures, improving their resilience and reactivity in nonlinear and time-varying environments. Some of the notable RL algorithms used include Twin-Delayed DDPG (TD3), Proximal Policy Optimization (PPO), Dueling DQN, and DDPG [33,34,35,36]. Actor–critic algorithms, such as DDPG and TD3, have the advantage of being able to handle continuous control problems and converge better than DQN and Dueling DQN, which makes them suitable for power electronics [34]. In particular, the DDPG algorithm is favorable to be used in continuous-action spaces like power converters, as it is off-policy and sample efficient [37]. As such, this paper suggests the adoption of a DDPG-based RL algorithm into our Adaptive Backstepping Controller (ABSC) framework in order to leverage the Lyapunov-based stability of the controller and the adaptive nature of DDPG. Nonetheless, an outstanding issue has yet to be addressed: the system’s performance is subject to the originally tuned controller gains. Metaheuristic Algorithms (MAs) can be used to address this concern in the optimization process.

MAs have shown significant outcomes in handling complex models, such as high-ordered equations or unpredictable variations [38,39]. They are also beneficial in the modes of exploitation and exploration for both new search areas and existing search areas [40]. Recently, the utilization of metaheuristic algorithms for tuning controllers in power converters and motor drives has been gaining popularity. MAs, such as Particle Swarm Optimization (PSO), Ant Lion Optimizer (ALO), and Genetic Algorithms (GAs), can enhance a controller’s overall performance by optimizing parameters that are not effectively adjusted via traditional methods [41,42,43,44]. The integration achieved has led to improvements in the response of a given system, including enhanced transient response, increased robustness to disturbances, and reduced overshoot. MAs offer flexibility and global search methodologies, enabling the effective optimization of complex, nonlinear, and multi-modal optimization problems. This makes them suitable for controller tuning in power converters and motor drives. The GWO algorithm was developed as a flexible metaheuristic technique inspired by the hunting and mating behaviors of wolves [45]. The GWO algorithm separates individuals into groups, making wolf-like movements to explore and efficiently exploit the search space. It has outperformed traditional optimization methods for complex problems by first searching or exploring the space and then exploiting solutions that improve towards the best solution [46]. Recently published research has shown that the GWO algorithm has the potential to be applied to several optimization problems with maximum performance, notably the optimization problem of control parameters for quantum processors, solving high-dimensional, non-convex problems with state-of-the-art performance from the systems. In association with renewable energy, an enhanced GWO algorithm has been utilized for maximum power point tracking of photovoltaic systems, providing improved convergence speed, accuracy, and greater stability of the systems during partial shading conditions [47,48]. The advantages of this work are improved optimization efficiency, more reliable solutions, and the ability to optimize in complex high-dimensional search spaces. It is essential to keep in mind that the design of the GWO algorithm to control and manage the balance between exploration and sampling is substantial, as it is a practical optimization algorithm for designing controllers of power converters and motor drives optimally, with the potential for improvement in system performance and robustness.

The novelty of this work relies on presenting a robust adaptive backstepping controller enhanced by Lyapunov-based stability, integrating the RL-DDPG algorithm to optimize its control parameters online, leading to better disturbance-rejection behavior and higher adaptation to the working environment. It also benefited from the GWO algorithm to find the best-tuned initial gains for the controller block, leading to faster convergence and lower sensitivity to errors. This work is designed based on a black box strategy to reduce dependency on system modeling, resulting in lower complexity burden and ease of implementation in real-time applications. The primary contributions of this approach areas follows:

An adaptive backstepping control approach is designed for Boost and Buck converters, which is improved in dynamics using the Lyapunov theorem. This approach can guarantee accurate monitoring by reducing susceptibility to error.
The ABSC approach is enhanced with the RL-DDPG method to optimize control parameters online to reach better disturbance-rejection behavior and higher adaptability to the working environment.
The GWO algorithm is adopted as an evolutionary computation part to initially optimize the parameters of the system for faster convergence in the controller block, which leads to better adaptation to the system frequency behavior.
This is a model-free controller that lowers the dependency on exact mathematical modeling of the system, providing faster dynamics and better disturbance-rejection behavior.
The use of HIL setups for real-time testing provides a rigorous validation of the system’s performance in both simulation and experimental environments, enhancing the credibility and practical relevance of the results.

2. Mathematical Modeling of the Converters

2.1. Boost Converter

This is a non-isolated step-up power converter widely used in energy conversion systems. It operates by switching a semiconductor device (usually a MOSFET) in two states, controlled through a Pulse Width Modulation (PWM) signal. The basic configuration of this topology is shown in Figure 1a. Looking at the two switching functions in Figure 1b,c, we define the averaged state-space model of the converter [1].

2.2. State Variables and Input Definition

We can define the state vector, input, and output as follows [2]:

\begin{matrix} x (t) & = [\begin{matrix} i_{L} (t) \\ V_{C} (t) \end{matrix}], \\ u (t) & = E (t) (Input voltage), \\ y (t) & = V_{C} (t) (Output voltage) . \end{matrix}

The Boost converter dynamics can be modeled in the standard state-space form:

\dot{x} (t) = A x (t) + B u (t), y (t) = C x (t)

(1)

2.3. Circuit Equations

2.3.1. Switch ON (Q Closed)

During the ON state (switch is conducting), the inductor is charged and the diode is reverse biased. The circuit equations are as follows:

\begin{matrix} \frac{d i_{L}}{d t} & = \frac{E}{L} \end{matrix}

(2)

\begin{matrix} \frac{d V_{C}}{d t} & = - \frac{V_{C}}{R C} \end{matrix}

(3)

2.3.2. Switch OFF (Q Open)

During the OFF state (switch is open), the inductor discharges through the diode to the output:

\begin{matrix} \frac{d i_{L}}{d t} & = \frac{E - V_{C}}{L} \end{matrix}

(4)

\begin{matrix} \frac{d V_{C}}{d t} & = \frac{i_{L} - \frac{V_{C}}{R}}{C} \end{matrix}

(5)

2.4. Averaged State-Space Model

By averaging the dynamics over a switching cycle with duty ratio D, we obtain the averaged state-space equations [1]:

\dot{x} (t) = [\begin{matrix} {\dot{i}}_{L} \\ {\dot{V}}_{C} \end{matrix}] = A x (t) + B u (t)

(6)

where the matrices A and B are as follows:

A = D [\begin{matrix} 0 & 0 \\ 0 & - \frac{1}{R C} \end{matrix}] + (1 - D) [\begin{matrix} 0 & - \frac{1}{L} \\ \frac{1}{C} & - \frac{1}{R C} \end{matrix}]

(7)

B = [\begin{matrix} \frac{1}{L} \\ 0 \end{matrix}]

(8)

C = [\begin{matrix} 0 & 1 \end{matrix}]

(9)

2.5. Buck Converter

The Buck converter is a step-down DC–DC converter that reduces the input voltage to a lower output level. It operates by rapidly switching a transistor to control energy transfer to the load. The basic structure is shown in Figure 2. Based on the two switching functions in Figure 2b,c, one can define the averaged state-space model of the converter.

2.6. State Variables and Input Definition

We define the system as a two-state system with:

\begin{matrix} x (t) & = [\begin{matrix} i_{L} (t) \\ V_{C} (t) \end{matrix}] (state vector) \\ u (t) & = E (t) (input voltage) \\ y (t) & = V_{C} (t) (output voltage) \end{matrix}

2.7. Mode 1: Switch ON (Q Closed)

The general state-space representation is given by Equation (1). When the switch is ON:

Diode is reverse-biased.
Inductor is connected directly to the source.
Capacitor supplies the load and is charged.

The governing equations are as follows:

\begin{matrix} \frac{d i_{L}}{d t} & = \frac{E - V_{C}}{L} \end{matrix}

(10)

\begin{matrix} \frac{d V_{C}}{d t} & = \frac{i_{L} - \frac{V_{C}}{R}}{C} \end{matrix}

(11)

2.8. Mode 2: Switch OFF (Q Open)

When the switch is OFF:

Diode conducts.
Inductor current freewheels through the diode.

The governing equations remain the same

\begin{matrix} \frac{d i_{L}}{d t} & = \frac{- V_{C}}{L} \end{matrix}

(12)

\begin{matrix} \frac{d V_{C}}{d t} & = \frac{i_{L} - \frac{V_{C}}{R}}{C} \end{matrix}

(13)

2.9. Averaged State-Space Model

Using the duty cycle

D \in [0, 1]

, the averaged state-space model in the continuous conduction mode (CCM) is as follows:

\dot{x} (t) = [\begin{matrix} {\dot{i}}_{L} \\ {\dot{V}}_{C} \end{matrix}] = A x (t) + B u (t)

(14)

where the matrices are given as follows:

A = [\begin{matrix} 0 & - \frac{1}{L} \\ \frac{1}{C} & - \frac{1}{R C} \end{matrix}], B = [\begin{matrix} \frac{D}{L} \\ 0 \end{matrix}]

(15)

C = [\begin{matrix} 0 & 1 \end{matrix}]

(16)

Based on the state-space matrices defined for both of the converters, we can assume the following second-degree transfer function for the further considerations:

G (s) = \frac{k_{1}}{s^{2} + k_{2} s + k_{3}}

(17)

Equation (17) expresses the simplified second-order transfer function of the Buck and Boost converters, where the dominant dynamics are governed by the inductor current and output voltage. This form is used to define the system order for the backstepping design, without relying on precise converter parameters. The assumption of a second-order system is justified by the averaged continuous-conduction-mode model of Buck and Boost converters, where higher-order effects and parasitics are neglected. This allows the proposed controller to be designed in a black box manner, while the RL agent adapts the gains to compensate for uncertainties and unmodeled dynamics. It should be noted that the state-space equations are only used to establish the system order (second order for Buck and Boost converters), whereas the DDPG agent operates in a black box manner, learning control gains directly from data without requiring explicit converter parameters. The proposed controller is a mode-free controller that can regulate both converters without an exact mathematical model of the systems. The values of the parameters used in both converters are listed in Table 1.

3. Controllers

Adaptive Backstepping Technique

To make a state-augmented adaptive backstepping controller requires error equations and an extra condition that is added to the system’s equations based on error integration. In addition, system terms are broken down into subsystems, and the Lyapunov function is used to create a simulated control signal for each subsystem that tries to keep it stable. All of the system equations will be looked at in the last step of creating the backstepping controller, and the control output will be made in a way that the Lyapunov function can guarantee system stability [15].

This stage’s goal is to use the Lyapunov function to stabilize these subsystems. In order to create the control signal based on guaranteeing system stability by the Lyapunov function, all of the BSC design’s equations will be met [15]. Equation (18) is the error equation:

e (t) = V_{r} (t) - V_{o} (t)

(18)

In (18), the error is the product of the output voltage (

V_{o}

) and the reference voltage (

V_{r}

). In (19), we introduce three elements: the error itself, the differentiation of the error, and an additional form that employs the weighing integration of the error:

\begin{matrix} x_{1} (t) & = n \int_{0}^{t} e (τ) d τ, \\ x_{2} (t) & = e (t), \\ x_{3} (t) & = e (t - 1) . \end{matrix}

(19)

To design the control signal for the proposed controller, the discrete transfer function of the system is needed, which has been defined based on Equation (20) using Zero-Order Hold (ZOH) method:

\frac{V_{o}}{u} = \frac{P_{3} z - P_{4}}{z^{2} - P_{1} z + P_{2}}

(20)

In Equation (20),

P_{1} \dots P_{4}

are coefficients of the system.

The simplification process for the following steps is described by the author in [20] based on the time domain. Next, new up-to-date conditions can be introduced:

\begin{matrix} x_{1} (t - 1) & = n x_{2} (t), \\ x_{2} (t - 1) & = x_{3} (t), \\ x_{3} (t - 1) & = P_{3} (- u (t - 1) + r y) . \end{matrix}

(21)

Consequently, using the process described in [20], the final control law can be defined in Equation (22).

\begin{matrix} u (t - 1) = \frac{1}{P_{3}} [V_{r} (t - 2) - P_{1} V_{o} (t - 1) + \\ P_{2} V_{o} (t) + P_{4} u (t - 2) + k_{1} x_{3} (t) + k_{2} x_{2} (t) + n x_{2} (t) + \\ k_{3} (x_{3} (t) + \frac{k_{2}}{n} x_{1} (t)) + k_{1} x_{2} (t)] . \end{matrix}

(22)

In Equation (22), the gains of the controller are shown as

k_{1}

,

k_{2}

, and

k_{3}

. The control law is changed to reflect the adaptation mechanism as follows:

\begin{matrix} u (t - 1) = \frac{1}{{\hat{P}}_{3}} ( & V_{r} (t - 2) - {\hat{P}}_{1} V_{o} (t - 1) + {\hat{P}}_{2} V_{o} (t) + {\hat{P}}_{4} u (t - 2) + k_{1} x_{3} (t) + \\ k_{2} x_{2} (t) + n x_{2} (t) & + k_{3} (x_{3} (t) + \frac{k_{2}}{n} x_{1} (t) + k_{1} x_{2} (t))) . \end{matrix}

(23)

Moreover, by replacing the control law in the system equations, the error dynamics equations can be formed as follows (24):

X (t - 1) = A X (t) + B u,

(24)

[\begin{matrix} x_{1} (t - 1) \\ x_{2} (t - 1) \\ x_{3} (t - 1) \end{matrix}] = [\begin{matrix} 0 & 1 & 0 \\ 0 & 0 & 1 \\ - \frac{k_{3} k_{2}}{n} & - k_{2} - n - k_{3} k_{1} & - k_{1} - k_{3} \end{matrix}] [\begin{matrix} x_{1} (t) \\ x_{2} (t) \\ x_{3} (t) \end{matrix}] + [\begin{matrix} 0 \\ 0 \\ 1 \end{matrix}] {\hat{P}}_{3} s^{T} y

where

s^{T} = [(\frac{1}{{\hat{P}}_{3}}, \frac{1}{{\hat{P}}_{3}}) (\frac{{\hat{P}}_{1} - {\bar{P}}_{1}}{{\hat{P}}_{3}}) (\frac{{\hat{P}}_{2} - {\bar{P}}_{2}}{{\hat{P}}_{3}}) (\frac{{\hat{P}}_{3} - {\bar{P}}_{3}}{{\hat{P}}_{3}}) (\frac{{\hat{P}}_{4} - {\bar{P}}_{4}}{{\hat{P}}_{3}})], y = [\begin{matrix} V_{o} (t - 2) \\ V_{o} (t - 1) \\ V_{o} (t) \\ u (t - 2) \end{matrix}] .

The selected Lyapunov function for analyzing stability is:

v (t) = \frac{1}{2} X^{T} (t) P X (t) + \frac{1}{2 γ} s^{T} s

(25)

where P is a symmetrical positive definite matrix. Furthermore, differentiating (25) results in the following:

v (t - 1) = \frac{1}{2} X^{T} (t - 1) P X (t) + \frac{1}{2} X^{T} (t) P X (t - 1) + \frac{1}{γ} s^{T} s .

(26)

The third term in Equation (26) originates from the derivative of the cross-term

\frac{1}{2 γ} s^{T} s

in Equation (25). In the discrete-time expansion, this produces the additional

\frac{1}{γ} s^{T} s

contribution. This ensures that the Lyapunov function properly accounts for the adaptive signal energy across successive time steps. Then, consequently, for the Lyapunov function and the parameters of control law, the below equations are presented:

\begin{matrix} \dot{\hat{s}} & = γ X^{T} (t) P B {\hat{P}}_{3} y, \\ v (t - 1) & = - \frac{1}{2} X^{T} (t) Q X (t), \\ v (t - 2) & = - X^{T} (t - 1) Q X (t) . \end{matrix}

(27)

The process of designing the controller is shown in [20] using a mathematical description and the role of other adopted methods in developing this structure. To optimize the gains of the controller (

k_{1}

,

k_{2}

,

k_{3}

), we have combined the controller with RL to achieve better results for the controller, which is described in the following sections.

4. RL-Based Optimization of ABSC Gains

To enhance both performance and adaptability of the ABSC approach, a model-free reinforcement learning strategy is employed. Specifically, the DDPG algorithm is used to optimize three essential control gains,

k_{1}

,

k_{2}

, and

k_{3},

in real time [12].

4.1. Problem Setup

The controller aims to generate an optimal control law using adaptive gains:

k (t) = [\begin{matrix} k_{1} (t) \\ k_{2} (t) \\ k_{3} (t) \end{matrix}]

(28)

This task is cast as a continuous control problem modeled using a Markov Decision Process (MDP) with:

State:

$s (t) = [e (t), \dot{e} (t), \int e (t) d t, V_{o} (t)]$

(29)
Action:

$a (t) = k (t)$

(30)
Reward: A function penalizing poor tracking and excessive control effort.

The reward function is defined as follows:

r (t) = - [α_{1} e^{2} (t) + α_{2} u^{2} (t) + α_{3} {(\frac{d V_{o} (t)}{d t})}^{2}]

(31)

The coefficients

α_{1}

,

α_{2}

, and

α_{3}

in the reward function were selected to balance tracking accuracy, control effort, and output smoothness. A sensitivity analysis was conducted to determine the most suitable values for these parameters. Specifically,

α_{1}

was assigned the highest weight to prioritize precise voltage tracking,

α_{2}

was chosen as smaller to limit excessive switching activity, and

α_{3}

was selected to penalize abrupt variations in the output voltage, thereby enhancing noise robustness. Based on simulation experiments, the final values were set as

α_{1}

= 10,

α_{2}

= 0.5, and

α_{3}

= 1, which provided the most favorable balance between fast convergence, low steady-state error, and robustness in both simulation and HIL validation.

4.2. DDPG Agent Architecture

DDPG is an actor–critic DRL algorithm intended to work with continuous action spaces. It uses the advantages of deterministic policy gradients, and the policy (actor) can output continuous-valued actions directly with deep function approximation. DPG is an actor–critic model, and as such, has two neural networks: an actor network, which takes states as inputs and outputs actions (e.g., controller gains), and a critic network, which takes pairs of states and actions as inputs and outputs the Q-value of the state–action pair to guide learning. A key benefit of DDPG is that it can learn optimal control policies in real time over nonlinear and high-dimensional systems without the need of an explicit mathematical model of the environment. Also, DDPG has more stable and converging training with the use of experience replay and target networks. These properties render DDPG especially suitable in control applications, including tuning gains in ABSC, where continuous, adaptive learning of policy is necessary in a dynamic context [37]. The DDPG agent includes:

An actor network $μ (s | θ^{μ})$ , which outputs continuous-valued gains $k$ .
A critic network $Q (s, a | θ^{Q})$ , which estimates expected return.
Target networks $μ^{'}$ and $Q^{'}$ for stable learning.
A replay buffer for experience replay and sample efficiency.

4.3. Mathematical Formulation

The critic is trained to minimize the Bellman loss:

L (θ^{Q}) = E_{s, a, r, s^{'}} [{(Q (s, a ∣ θ^{Q}) - y)}^{2}]

(32)

with target:

y = r + γ Q^{'} (s^{'}, π (s^{'} ∣ θ^{μ^{'}}) ∣ θ^{Q^{'}})

(33)

where

θ^{Q^{'}}

and

θ^{μ^{'}}

denote the parameters of the target critic and actor networks, respectively.

The actor is trained using the deterministic policy gradient:

\nabla_{θ^{μ}} J = E_{s} [\nabla_{a} Q (s, a ∣ θ^{Q}) |_{a = π (s)} \nabla_{θ^{μ}} π (s ∣ θ^{μ})]

(34)

During training, exploration is ensured by injecting noise into the actor’s action:

a_{t} = π (s_{t} ∣ θ^{μ}) + N_{t}

(35)

where

N_{t}

is generated by an Ornstein–Uhlenbeck process to provide temporally correlated exploration.

The generated adaptive gains

k (t)

are injected into the ABSC controller to calculate the following:

u (t) = f (e (t), \dot{e} (t), \int e (t) d t; k (t))

(36)

4.4. Learning Loop and Training Behavior

At each time step:

Observe system state $s (t)$ .
Actor network outputs gains $k (t)$ .
Controller computes $u (t)$ using these gains.
System returns new state $s (t + 1)$ and reward $r (t)$ .
Transition $(s, a, r, s^{'})$ is stored in the replay buffer.
Critic and actor networks are updated from sampled mini-batches using Equations (32)–(35).
Target networks are updated with a soft update rule to improve stability.

Training Behavior: In practice, the cumulative reward is increased slowly as training progresses, and this shows some fluctuations in the first episodes due to exploration noise. Initialization of the controller gains with Grey Wolf Optimization (GWO) will significantly reduce unstable exploration at the beginning of training and increase the speed of convergence. As training continues, the actor network stabilizes, and stable adaptive gains are achieved that optimize ABSC performance. The given behavior proves that the DDPG agent has learned a stable control policy.

4.5. Stability Consideration

To maintain stability, the Lyapunov condition is monitored:

\dot{V} (t) = \frac{d}{d t} (\frac{1}{2} X^{T} (t) P X (t) + \frac{1}{2 γ} s^{T} s) \leq 0

(37)

The reward function penalizes unstable responses, guiding learning toward safe regions in the control space. Figure 3 shows a general clarification of the RL algorithm and its working process.

In addition to the Lyapunov-based adaptive backstepping structure, the proposed controller relies on a RL agent implemented with deep neural networks. To ensure reproducibility and clarity, the neural network architecture, training procedure, and implementation details of the DDPG agent are described in the following section.

4.6. Neural Network Training and Architecture

The actor–critic networks of the DDPG agent were implemented as fully connected feedforward neural networks shown in Figure 4. The complete architecture is shown in Figure 4a, while the training and regression performance are summarized in Figure 4b. The main hyperparameters and settings used in training are listed in Table 2.

Though the DDPG policy allows real-time control gain adaptation, convergence performance and quality of learning strongly depend on the initial gain values. Improperly initialized gains may result in unstable exploration, reduced convergence rates, or even the inability to meet the Lyapunov-based stability condition early in learning. An initial policy should thus be well selected to speed up the reinforcement learning and lead it to an optimal solution.

To solve this problem, we incorporate the GWO algorithm as a meta-heuristic pre-training technique to optimize the initial gain values

k_{1}

,

k_{2}

, and

k_{3}

before feeding them to the DDPG agent. The GWO algorithm is an imitation of the hierarchy of the leadership and hunting process of grey wolves and has proved to be very effective in global search processes in optimization control applications. In the next section, we describe in detail the GWO algorithm and how it was implemented to initialize the reinforcement learning block.

4.7. Grey Wolf Optimization for Initial Gain Tuning

To improve the initialization of the reinforcement learning agent, the Grey Wolf Optimization (GWO) algorithm is employed as a global search strategy to identify near-optimal initial values for the controller gains

k_{1}

,

k_{2}

, and

k_{3}

before launching the DDPG learning process. GWO is a nature-inspired, population-based metaheuristic that mimics the social hierarchy and hunting behavior of grey wolves. The population is divided into different main wolves. The top three solutions (

β

,

α

, and

δ

) guide the movement of the rest of the population. The search process includes three main phases: encircling prey, hunting, and attacking, all modeled mathematically to update the position of each search agent. One can study detailed benchmarking of this algorithm in [22].

Comparative Analysis of Optimization Strategies

In this study, GWO is applied to minimize a cost function composed of tracking error and control smoothness, which shares the same structure as the reward function used by the DDPG agent. Once the optimal gain vector is identified through GWO, it is used to initialize the actor network parameters of the RL controller, thus improving the training convergence and stability of the policy [12].

In order to ensure a fair comparison between the proposed GWO initializer and other metaheuristic algorithms, the same set of adaptive controller gains

(k_{1}, k_{2}, k_{3})

was optimized using identical search ranges and cost function. The algorithms differ only in their search strategies and specific hyperparameters, as summarized in Table 3.

All algorithms were applied to the same three adaptive gains

(k_{1}, k_{2}, k_{3})

under identical search ranges and cost function. The population size, iteration count, and search dimension were fixed equally across algorithms to guarantee a fair comparison. Algorithm-specific settings (e.g., inertia in PSO, crossover and mutation rates in GA) are reported for reproducibility. All algorithms were executed with the same population size (30), number of iterations (100), and search dimension (

d = 3

, corresponding to the three adaptive gains).

To demonstrate the superiority of the GWO-based initialization approach, we conducted a comparative evaluation against other common metaheuristics, including PSO, GA, and ALO. The performance metrics used in the comparison include:

Convergence speed: Number of iterations to reach 95% of minimum cost.
Final tracking error: Mean square error after optimization.
Control effort: Sum of squared control input.
Stability margin: Evaluated through Lyapunov derivative analysis.

The results in Table 4 clearly demonstrate the effectiveness of GWO in identifying better starting points for the reinforcement learning agent, which contributes to faster convergence and improved stability.

As shown, the GWO algorithm achieves the best overall performance across all metrics. It converges faster, maintains high robustness with minimal variation, exhibits low sensitivity to parameter changes, and produces the lowest cost function (fitness score). This justifies its selection as the optimal metaheuristic algorithm for fine-tuning the RL-ABSC method in this work. The overall schematic diagram of this work is depicted in detail in Figure 5.

Table 5 summarizes the computational complexity and practical real-time feasibility of the three controllers considered in this study. Both ABSC and RL–PID require only a fixed number of algebraic operations at each sampling instant, resulting in negligible processor load (<10%). The proposed DDPG+GWO–ABSC, while more demanding due to the actor network evaluation, remains computationally efficient with an execution time of approximately 30 µs at a 100 µs sampling period, corresponding to about 25% CPU utilization. The one-time grey wolf optimization is performed offline for initialization and does not impact the hardware loop. These results confirm that all controllers are real-time feasible on the TI LaunchXL-F28379D platform, with the proposed method offering enhanced adaptability at the cost of modestly higher computational requirements.

5. Result of Simulations

This section presents the results of the RL-ABSC across different operating scenarios based on the circuit specifications provided in Table 1. To begin, the controllers are evaluated under nominal conditions, without external disturbances, to assess their ability to maintain stable control. Figure 6 illustrates the tracking performance of the applied control strategies for the Boost converter with a resistive load, shown over a time domain in seconds.

5.1. Case 1: Output Regulation

In the first scenario, the controller’s performance was thoroughly assessed without any external disturbances in an operating range of 50 V supply voltage for the Boost and 100 V for the Buck converters, fed to a resistive load (R = 50 Ω).

Figure 6 shows the efficiency of the RL-PID controller in tracking reference values of 80–120 V in the Boost mode and 50–80 V in the Buck mode, whereas the conventional RL-PID and BSC show significant overshooting and signal distortion. The details in Figure 5 shows that the BSC has reached the reference signal faster, but it suffers from fluctuation and sensitivity to error, making it an undesirable alternative in this case. The RL-PID controller exhibits quicker dynamics and a reduced performance in term of overshooting and undershooting, which are of critical importance in practical applications. Moreover, the RL-ABSC method has significant performance in both convergence speed and lower sensitivity to error, ensuring its robustness in these structures.

In addition to the qualitative waveforms shown in Figure 6, the main quantitative performance indices of the three controllers (ABSC, RL–PID, and the proposed DDPG+GWO–ABSC) are summarized in Table 6. These indices include rise time, overshoot, settling time, RMSE, and IAE, providing a numerical comparison for both Buck and Boost modes. The results confirm that the proposed controller achieves the lowest overshoot, fastest rise and settling times, and reduced error.

To confirm the efficiency and robustness of the proposed controller, we experiment with a wide variety of scenarios. In the former case, we varied the output voltage (reference signal). In order to do this, we recreated three varying and abrupt output voltage changes that are likely to vary. Compensation of shifting trajectory references is a major challenge to controllers. Because of its strong responses and model-free architecture, the RL-ABSC approach can follow the reference signal with rapid dynamism, as shown in Figure 7 for both operational converters.

The gains of the BS controller, on the other hand, are inappropriate for these demanding circumstances; they experience saturation and find it difficult to accurately follow these voltage references. Additionally, because of its appropriate gain tuning procedure, the RL-PID exhibits appropriate behavior and quick tracking in these changes. The negative effects that variations in the tracking performance of traditional techniques might have on the internal components thus highlight the vital need for an online tuning mechanism in practical applications.

5.2. Case 2: Supply Voltage Variation

Another difficult condition is when the supply voltage fluctuates. The supply voltage is sometimes equated to a battery charged with renewable energy. We try to manage the output voltage while taking these uncertainties into account in order to meet this requirement. Therefore, Figure 8 tests the controllers’ performance under this scenario, and the converter’s three-step alterations are investigated. Both BS-based controllers are insensitive to these changes, according to the results for the Boost converter; however, the RL-PID controller experiences more digression and slower dynamics in overcoming these variations. BSC has, however, demonstrated saturation in tracking abrupt changes in the supply voltage when operating in Buck mode. The RL-ABSC technique accounts for the effects of these uncertainties with appropriate reactions.

5.3. Case 3: Load Uncertainty

The next scenario concerns the introduction of uncertainties into the structure of the system, which is a crucial issue for converter controllers. Unfavorable operating conditions may be the source of the system’s structural uncertainties. As shown in Figure 9, we abruptly changed the load on the converters to mimic the impact of parametric fluctuations on controller performance.

In Figure 9, sudden load variations are examined with a variation of 50 Ω to 10 Ω at 0.6 s and then from 10 Ω to 100 Ω at 1.2 s for both converters. In the Buck mode, BSC could not handle the first variation and faced saturation in its dynamics, while the RL-PID controller showed batter adaptation with slower dynamics than the RL-ABSC method. However, looking at Figure 9, it is obvious that all of the controllers are showing proper regulation in these cases, but BS-based controllers are showing higher adaptability in these variations.

5.4. Case 3: Noise Impact

It is an indisputable fact that converters are working in real-time application under various disturbances. In addition, noise is one of the most unpredictable circumstance in these scenarios, with high impact on the converter’s components, which can lead to malfunction or error. To validate the performance of the controller under these challenging cases, we have injected different noise in the ranges of 0.5–2 variance. Figure 10 is dedicated to this case showing different variances of noise applied to the converters. Despite the high impact of this influential disturbance, the proposed robust adaptive controller is able to compensate for this concern and keep the output signal at the desired level. However, two other controllers are not working suitably in this case, making them less practical for real-time applications.

6. Experimental Implementation

The Typhoon HIL platform was employed for experimental validation owing to its FPGA-based real-time solver, which ensures high fidelity in capturing the fast switching dynamics of power converters, as well as its seamless integration with DSP-based digital controllers. In contrast to general-purpose real-time simulators, Typhoon HIL offers specialized libraries for power electronic components, making it particularly suitable for validating adaptive control strategies in converter applications. Furthermore, its safe and versatile environment enables comprehensive testing across diverse operating conditions without posing any risk of hardware damage.

The experimental system was developed using a Typhoon HIL 404 real-time simulator (Somerville, MA, USA) (Figure 11) and a Texas Instruments LaunchXL-F28379D microcontroller (Dallas, TX, USA). The system models both DC–DC Buck and Boost converters with real-time control and monitoring capabilities. The experimental setup includes all primary components of the converter, such as inductors, capacitors, switches, and control circuitry, configured to represent real-world converter dynamics. The experiments were performed under different conditions to assess the controller’s robustness. More details for this technology are listed in [12].

This configuration speeds up development, improves safety, and offers a real-time evaluation of system responsiveness and stability. The SCADA setup and HIL simulation used in this investigation are shown in Figure 12. The online platform for the converters with the suggested voltage controller is displayed in Figure 12a, and the real-time SCADA platform for the online results is displayed in Figure 12b. This real-time testing configuration is used to test various scenarios in order to assess the effectiveness of the suggested controller for the voltage regulation of the converter.

6.1. Case: Tracking Performance

To evaluate the performance of this structure in real-time situations, the RL-ABS controller tracks references of 70–100 V in Boost mode and 60–80 V in Buck mode without any disruptions (Figure 13).

The convergent behavior of the RL-ABSC controller with different reference values of both working converters is shown in Figure 13 and leads to optimal operation and minimum error. It needs to be mentioned that real-time adaptation of the controller to both types of converters exhibits high tracking and quick convergence. Another possibility is to modify the reference voltage to accommodate different loads. Using numerous step adjustments, the impact of this sudden shift is assessed in Figure 14, showcasing the controller’s outstanding performance for both architectures. With a significant response and low error, it is clear from looking at Figure 14 that the designed controller successfully reduces rapid changes in the reference signal.

6.2. Case: Supply Voltage Variation

We have then shown that this approach is robust when it comes to changing the supply voltage level. The supply voltage of the converter is comparable to a battery that is charged by solar energy or other renewable energy sources. It is feasible that the supply voltage level may fluctuate during the converter’s operation. Therefore, the tracking signal must be maintained at the same level as the reference signal in order to maintain the robustness of the controller that was developed in response to this disturbance. In order to investigate the impact of these phenomena on the systems, supply voltage fluctuations are evaluated in Figure 15. In Figure 15, the supply voltage has experienced a range of sudden variations within the range of 40–66 V for Boost, while the tracking signal is set at 70 V and a range of 70–120 V for the Buck converter with set value of 60 V. The performance of the controller depicts robust performance against these sudden alterations with a low sensitivity and no digression from the desired value. These results ensure the robustness of the controller and provide promising outcomes for real-time applications.

6.3. Case: Load Variation

There is a considerable chance of parametric deviations or changes in output load in industrial settings where converters are used since ideal conditions are not available. A variety of load types will be given to the converter in order to verify the controller’s exceptional efficacy and robust performance in real-time applications. The continuous resistive load may not always be the same as the load applied to the converter. The converter’s reactions to abrupt changes in load are shown in Figure 16. In this instance, the load value fluctuates between 5 Ω and 1 kΩ, and the tracking value is set at 70 V in the Buck mode and 80 V in the Boost mode. Figure 16 illustrates how the RL-ABSC converges to adapt and robustly compensate for variations in the converters’ loads. It demonstrates that while lower resistances cause more dynamical changes in the systems, they have a greater effect on the controller’s performance.

6.4. Case: Noise Impact

The effect of noise presents another difficult situation. These power applications are negatively impacted by a variety of disturbances, so it is essential to provide a strong framework to counteract the detrimental effects of noisy settings. Additionally, the system will be evaluated with topologies that have been altered by 1.5 and 3. In both operating modes, Figure 17 shows an amazing noise correction with significant responses.

6.5. Case: Sudden Variation

Moreover, to test the performance of the controller against sudden alterations, an impulse reference signal is applied to the system in Figure 18 to both systems. In this scenario, an impulse pulse is set as the reference signal in the range of 90 V to 100 V for the Boost converter and in the range of 90 V to 70 V for the Buck converter. This result makes it abundantly evident that the suggested controller is following these abrupt changes with remarkable robustness.

Figure 13, Figure 14, Figure 15, Figure 16, Figure 17 and Figure 18 demonstrate the advantages of using RL-ABSC for both converters, since this method is robust and adaptive structure, and it performed well under different and challenging scenarios. To better highlight the superiority of this technique, the following section compares the effective elements and criteria for various controllers.

7. Discussion

The HIL experiments demonstrated that the proposed RL-ABSC controller achieves reliable tracking and robust performance under a variety of operating conditions, including supply voltage fluctuations, abrupt load variations, noise injection, and sudden reference changes. Beyond these performance outcomes, it is crucial to discuss the stability of the DDPG agent when deployed in real-time hardware, as reinforcement learning approaches are often criticized for instability and unsafe exploration. Several mechanisms were integrated into the framework to ensure stable operation in the HIL environment. First, the reward function was shaped by incorporating a Lyapunov-based penalty term, which guided the learning process away from unstable trajectories. This ensured that the adaptation of controller gains remained bounded and consistent with the theoretical stability condition. Second, the GWO algorithm was used to pre-train the controller gains, providing a safe initialization that significantly reduced the likelihood of divergence during the early exploration phase. Third, standard stabilization techniques from the DDPG algorithm—such as target networks, replay buffer, and soft updates—were employed to avoid instability caused by correlated updates or rapid parameter changes. From an implementation perspective, the controller was executed on the TI LaunchXL-F28379D microcontroller with a fixed sampling rate and known computation delay. Duty cycle limits, anti-windup mechanisms, and output voltage/current safety envelopes were enforced to protect the converter during online training. These constraints served as an additional safeguard against instability when the agent explored control actions in real time. Importantly, discrete-time Lyapunov monitoring was performed during HIL runs, confirming that the Lyapunov function remained non-increasing across all tested scenarios. No divergence, oscillatory instability, or unsafe behavior was observed in the hardware experiments. Overall, the combination of Lyapunov-guided reward design, safe initialization via GWO, practical implementation safeguards, and HIL-based validation provides strong evidence that the DDPG-enhanced ABSC framework can maintain stable operation in real-world converter control. This addresses one of the main limitations of reinforcement learning deployment in power electronics and highlights the feasibility of integrating intelligent adaptive controllers into practical hardware platforms.

To validate Lyapunov stability in hardware, the candidate function defined in Equations (25)–(27) was implemented on the DSP using real-time measurements of voltage error and adaptive terms. At each sampling step, the function value

V [k]

was computed, and its discrete difference

Δ V [k]

was monitored. Stability was confirmed, as

Δ V [k]

remained non-positive and the function trajectory decayed towards a bounded value across all tested scenarios. This approach ensures that the theoretical Lyapunov guarantees are directly observable in the HIL environment.

Although Table 4 provides a quantitative comparison of GWO with PSO, GA, and ALO in simulation, reproducing the same comparison in HIL was not feasible. Running multiple optimization algorithms in hardware would require separate converter setups or repeated firmware reconfiguration, which introduces safety risks and significant resource constraints. For this reason, we limited the HIL validation to the GWO-initialized controller, which was already shown in simulation to offer superior performance. Future work may investigate multi-algorithm HIL benchmarking frameworks to further expand this comparison.

8. Conclusions

This paper has developed a reliable adaptive BSC method for Boost and Buck converters. Conventional approaches lack the necessary efficiency to safeguard power converters from harmful disruptions, necessitating the need for more sophisticated and robust controllers to ensure reliable functioning in a variety of real-world scenarios. Therefore, an adaptive Backstepping method has been designed based on Lyapunov stability theory to both ensure the robustness of the converters in different working conditions and enhance their performance under disturbances. However, there are a number of scenarios and parametric unknowns that could cause disruptions to this approach. To solve this problem, we used the DPG algorithm in conjunction with a hardware configuration and Reinforcement Learning (RL) scheme to regulate the Buck converter’s output voltage. In particular, DDPG combines the advantages of the deterministic policy gradient method, deep Q-networks, and actor–critic networks. Additionally, a new and well-behaved metaheuristic method known as the GWO algorithm was used to increase the controller’s initial gain. We also intended to compare the RL-ABS controller with traditional ABSC and RL-PID controllers to demonstrate the superiority of this work. In the final analysis, the RL-ABSC technique outperformed conventional techniques thanks to its robust dynamics in all working modes. To improve the dynamical performance of the RL technique, the GWO algorithm was adopted to set the most suitable initial parameters for the RL block. This controller demonstrated faster dynamics and significantly better convergence. In addition, the results provided in a real-time platform using an HIL device ensure the significant robustness of the proposed controller and its well-behaved dynamics in disturbance-rejection modes. This work can further be improved using identification processes to better adapt the ABSC method to the undesired working environments. Future works will also explore the extension of the proposed RL-ABSC framework to multiphase and multi-converter systems, where the effective system order must first be assigned according to the topology before controller design.

Author Contributions

S.M.G.: conceptualization, methodology, software, writing—original draft, investigation, validation, implementation. A.A.: principal supervision, writing—review and editing, methodology, project management. M.G.: writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author(s).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Erickson, R.W. DC–DC power converters. In Wiley Encyclopedia of Electrical and Electronics Engineering; John Wiley & Sons: Hoboken, NJ, USA, 2001. [Google Scholar]
Xu, L.; Guerrero, J.M.; Lashab, A.; Wei, B.; Bazmohammadi, N.; Vasquez, J.C.; Abusorrah, A. A review of DC shipboard microgrids—Part I: Power architectures, energy storage, and power converters. IEEE Trans. Power Electron. 2021, 37, 5155–5172. [Google Scholar] [CrossRef]
Hu, J.; Shan, Y.; Cheng, K.W.; Islam, S. Overview of power converter control in microgrids—Challenges, advances, and future trends. IEEE Trans. Power Electron. 2022, 37, 9907–9922. [Google Scholar] [CrossRef]
Bushra, E.; Zeb, K.; Ahmad, I.; Khalid, M. A comprehensive review on recent trends and future prospects of PWM techniques for harmonic suppression in renewable energies based power converters. Results Eng. 2024, 22, 102213. [Google Scholar] [CrossRef]
Gupta, M.; Gupta, N.; Garg, M.M.; Kumar, A. Robust control strategies applicable to DC–DC converter with reliability assessment: A review. Adv. Control Appl. Eng. Ind. Syst. 2024, 6, E217. [Google Scholar] [CrossRef]
Shen, X.; Liu, J.; Liu, Z.; Ga, Y.; Leon, J.I.; Vazquez, S.; Wu, L.; Franquel, L.G. Sliding mode control of neutral-point-clamped power converters with gain adaptation. IEEE Trans. Power Electron. 2024, 39, 9189–9201. [Google Scholar] [CrossRef]
Shen, X.; Liu, G.; Liu, J.; Gao, Y.; Leon, J.I.; Wu, L.; Franquelo, L.G. Fixed-time sliding mode control for NPC converters with improved disturbance rejection performance. IEEE Trans. Ind. Inform. 2025, 21, 4476–4487. [Google Scholar] [CrossRef]
Qureshi, M.A.; Musumeci, S.; Torelli, F.; Reatti, A.; Mazza, A.; Chicco, G. A novel model reference adaptive control approach investigation for power electronic converter applications. Int. J. Electr. Power Energy Syst. 2024, 156, 109722. [Google Scholar] [CrossRef]
Liu, X.; Qiu, L.; Fang, Y.; Wang, K.; Li, Y.; Rodríguez, J. A simple model-free solution for finite control-set predictive control in power converters. IEEE Trans. Power Electron. 2024, 39, 12627–12635. [Google Scholar] [CrossRef]
Vazani, A.; Mirshekali, H.; Mijatovic, N.; Ghaffari, V.; Dashti, R.; Shaker, H.R.; Mardani, M.M.; Dragičević, T. Composite nonlinear feedback control of a DC-DC boost converter under input voltage and load variation. Int. J. Electr. Power Energy Syst. 2024, 155, 109562. [Google Scholar] [CrossRef]
He, W.; Zhang, Y.; Zhou, W. Observerless output feedback control of DC-DC converters feeding a class of unknown nonlinear loads via power shaping. IEEE Trans. Circuits Syst. I Regul. Pap. 2024, 71, 2951–2963. [Google Scholar] [CrossRef]
Ghamari, S.M.; Molaee, H.; Ghahramani, M.; Habibi, D.; Aziz, A. Design of an Improved Robust Fractional-Order PID Controller for Buck–Boost Converter using Snake Optimization Algorithm. IET Control Theory Appl. 2025, 19, E70008. [Google Scholar] [CrossRef]
Nithara, P.V.; Anand, R.; Ramprabhakar, J.; Meena, V.P.; Padmanaban, S.; Khan, B. Brayton–Moser passivity based controller for constant power load with interleaved boost converter. Sci. Rep. 2024, 14, 28325. [Google Scholar] [CrossRef]
Reddy, A.; Bhukya, C.N.; Venkatesh, A. Constant power load in DC microgrid system: A passivity based control of two input integrated DC-DC converter. e-Prime-Adv. Electr. Eng. Electron. Energy 2025, 11, 100941. [Google Scholar]
Al-Dabbagh, Z.A.; Shneen, S.W.; Hanfesh, A.O. Fuzzy Logic-based PI Controller with PWM for Buck-Boost Converter. J. Fuzzy Syst. Control 2024, 2, 147–159. [Google Scholar] [CrossRef]
Wiryajati, I.K.; Satiawan, I.N.W.; Suksmadana, I.M.B.; Wiwaha, B.B.P. Investigation and Analysis of Fuzzy Logic Controller Method on DC-DC Buck-Boost Converter. J. Penelit. Pendidik. IPA 2025, 11, 1066–1074. [Google Scholar] [CrossRef]
Manoharan, R.; Wahab, R.S. Model predictive controller-based Convolutional Neural Network controller for optimal frequency tracking of resonant converter-based EV charger. Results Eng. 2024, 24, 103658. [Google Scholar] [CrossRef]
Ramu, S.K.; Vairavasundaram, I.; Palaniyappan, B.; Bragadeshwaran, A.; Aljafari, B. Enhanced energy management of DC microgrid: Artificial neural networks-driven hybrid energy storage system with integration of bidirectional DC-DC converter. J. Energy Storage 2024, 88, 111562. [Google Scholar] [CrossRef]
Al-Dabbagh, Z.A.; Shneen, S.W. Neuro-Fuzzy Controller for a Non-Linear Power Electronic DC-DC Boost Converters. J. Robot. Control. (JRC) 2024, 5, 1479–1491. [Google Scholar]
Sahraoui, H.; Mellah, H.; Mouassa, S.; Jurado, F.; Bessaad, T. Lyapunov-Based Adaptive Sliding Mode Control of DC–DC Boost Converters Under Parametric Uncertainties. Machines 2025, 13, 734. [Google Scholar] [CrossRef]
Liu, X. Design of CCM boost converter utilizing fractional-order PID and Lyapunov-based PID techniques for PF correction. Electr. Eng. 2025, 107, 3451–3462. [Google Scholar] [CrossRef]
Ghamari, S.; Habibi, D.; Ghahramani, M.; Aziz, A. Design of a Robust Adaptive Cascade Fractional-Order Nonlinear-Based Controller Enhanced Using Grey Wolf Optimization for High-Power DC/DC Dual Active Bridge Converter in Electric Vehicles. IET Power Electron. 2025, 18, E70056. [Google Scholar] [CrossRef]
Cheng, H.; Jung, S.; Kim, Y. A novel reinforcement learning controller for the DC-DC boost converter. Energy 2025, 321, 135479. [Google Scholar] [CrossRef]
Chen, P.; Zhao, J.; Liu, K.; Zhou, J.; Dong, K.; Li, Y.; Guo, X.; Pan, X. A review on the applications of reinforcement learning control for power electronic converters. IEEE Trans. Ind. Appl. 2024, 60, 8430–8450. [Google Scholar] [CrossRef]
Vu, N.T.-T.; Nguyen, H.X.; Bui, M.Q. Adaptive optimal sliding mode control for three-phase voltage source inverter: Reinforcement learning approach. Trans. Inst. Meas. Control 2024, 46, 2001–2012. [Google Scholar]
Wan, Y.; Xu, Q.; Dragičević, T. Reinforcement learning-based predictive control for power electronic converters. IEEE Trans. Ind. Electron. 2024, 72, 5353–5364. [Google Scholar] [CrossRef]
Abdulkader, R.; Salem, M.; Senjyu, T. Adaptive Voltage Control of Single-Inductor 3x Multilevel Converters Interfaced DC Microgrids Using Multi-Agent Approximate Q-Learning. IEEE Access 2024, 12, 114295–114303. [Google Scholar] [CrossRef]
Oboreh-Snapps, O.; Sharma, A.; Saelens, J.; Fernandes, A.; Strathman, S.A.; Morris, L.; Uddarraju, P.; Kimball, J.W. Feedback Control of CLLLC Resonant DC-DC Converter using Deep Reinforcement Learning. In Proceedings of the 2024 IEEE Energy Conversion Congress and Exposition (ECCE), Phoenix, AZ, USA, 20–24 October 2024; IEEE: Piscataway, NJ, USA, 2024. [Google Scholar]
Wan, Y.; Xu, Q.; Dragičević, T. Safety-enhanced self-learning for optimal power converter control. IEEE Trans. Ind. Electron. 2024, 71, 15229–15234. [Google Scholar] [CrossRef]
Çimen, M. Controller Design For Dc-Dc Boost Converter Using PI, State Feedback and Q Learning. Gaziosmanpaşa Bilimsel Araştırma Derg. 2024, 13, 30–46. [Google Scholar]
Rajamallaiah, A.; Karri, S.P.K.; Shankar, Y.R. Deep reinforcement learning based control strategy for voltage regulation of DC-DC Buck converter feeding CPLs in DC microgrid. IEEE Access 2024, 12, 17419–17430. [Google Scholar] [CrossRef]
Zandi, O.; Poshtan, J. Voltage control of DC–DC converters through direct control of power switches using reinforcement learning. Eng. Appl. Artif. Intell. 2023, 120, 105833. [Google Scholar] [CrossRef]
Ghahramani, M.; Habibi, D.; Ghamari, S.; Aziz, A. Optimal Operation of an Islanded Hybrid Energy System Integrating Power and Gas Systems. IEEE Access 2024, 12, 196591–196608. [Google Scholar] [CrossRef]
Saha, U.; Jawad, A.; Shahria, S.; Rashid, A.B.M. H-U. Proximal policy optimization-based reinforcement learning approach for DC-DC boost converter control: A comparative evaluation against traditional control techniques. Heliyon 2024, 10, e37823. [Google Scholar] [CrossRef] [PubMed]
Muktiadji, R.F.; Ramli, M.A.M.; Milyani, A.H. Twin-Delayed Deep Deterministic Policy Gradient Algorithm to Control a Boost Converter in a DC Microgrid. Electronics 2024, 13, 433. [Google Scholar] [CrossRef]
Ye, J.; Zhao, D.; Pan, X.; Li, S.; Wang, B.; Zhang, X.; Iu, H.H.C. Improving Voltage Regulation of Interleaved DC-DC Boost Converter via Soft Actor-Critic Algorithm Based Reinforcement Learning Controller. IEEE J. Emerg. Sel. Top. Power Electron. 2025. [CrossRef]
Ghamari, S.M.; Habibi, D.; Aziz, A. Robust Adaptive Fractional-Order PID Controller Design for High-Power DC-DC Dual Active Bridge Converter Enhanced Using Multi-Agent Deep Deterministic Policy Gradient Algorithm for Electric Vehicles. Energies 2025, 18, 3046. [Google Scholar] [CrossRef]
Rajwar, K.; Deep, K.; Das, S. An exhaustive review of the metaheuristic algorithms for search and optimization: Taxonomy, applications, and open challenges. Artif. Intell. Rev. 2023, 56, 13187–13257. [Google Scholar] [CrossRef]
Tomar, V.; Bansal, M.; Singh, P. Metaheuristic algorithms for optimization: A brief review. Eng. Proc. 2024, 59, 238. [Google Scholar]
Nassef, A.M.; Abdelkareem, M.A.; Maghrabie, H.M.; Baroutaji, A. Review of metaheuristic optimization algorithms for power systems problems. Sustainability 2023, 15, 9434. [Google Scholar] [CrossRef]
Yakut, Y.B. A new control algorithm for increasing efficiency of PEM fuel cells–Based boost converter using PI controller with PSO method. Int. J. Hydrogen Energy 2024, 75, 1–11. [Google Scholar] [CrossRef]
Hollweg, G.V.; Evald, P.J.D.d.O.; Mattos, E.; Borin, L.C.; Tambara, R.V.; Montagner, V.F. Self-tuning methodology for adaptive controllers based on genetic algorithms applied for grid-tied power converters. Control Eng. Pract. 2023, 135, 105500. [Google Scholar] [CrossRef]
Peng, C.; Ghamari, S.M.; Mollaee, H.; Rezaei, O. Design of a novel robust adaptive fractional-order model predictive controller for boost converter using grey wolf optimization algorithm. Sci. Rep. 2025, 15, 27670. [Google Scholar] [CrossRef]
Khan, M.A.; Yousaf, M.Z.; Khalid, S.; Fashihi, D.; Bokhari, S.A.H.; Insafmal, B.K.; Abbas, G. Applying Ant Lion Optimization Technique to Enhance Power Converters Performance via Effective Controller Tuning. In Proceedings of the 2023 2nd International Conference on Emerging Trends in Electrical, Control, and Telecommunication Engineering (ETECTE), Lahore, Pakistan, 27–29 November 2023; IEEE: Piscataway, NJ, USA, 2023. [Google Scholar]
Liu, Y.; Azizan As’arry; Hassan, M.K.; Hairuddin, A.A.; Mohamad, H. Review of the grey wolf optimization algorithm: Variants and applications. Neural Comput. Appl. 2024, 36, 2713–2735. [Google Scholar] [CrossRef]
Krishnaram, K.; T. Padmanabhan, S.; Alsaif, F.; Senthilkumar, S. Development of grey wolf optimization based modified fast terminal sliding mode controller for three phase interleaved boost converter fed PV system. Sci. Rep. 2024, 14, 9256. [Google Scholar] [CrossRef] [PubMed]
Lai, S.; Wang, W. Design of CCM boost converter using fractional-order PID controller optimized with gray wolf algorithm for power factor correction. Int. J. Dyn. Control 2024, 12, 3685–3693. [Google Scholar] [CrossRef]
Jagatheesan, K.; Boopathi, D.; Samanta, S.; Anand, B.; Dey, N. Grey wolf optimization algorithm-based PID controller for frequency stabilization of interconnected power generating system. Soft Comput. 2024, 28, 5057–5070. [Google Scholar] [CrossRef]

Figure 1. Boost topology with its general structure and components: (a) Boost topology, (b) ON state, (c) OFF state [1].

Figure 2. Buck topology with its general structure and components: (a) Buck topology, (b) ON state, (c) OFF state [1].

Figure 3. Overview of the DDPG-based actor–critic framework.

Figure 4. Neural network training and architecture used in the DDPG agent: (a) feedforward network structure with two hidden layers, (b) regression and loss performance showing convergence and generalization.

Figure 5. Detailed controller structure.

Figure 6. Tracking outcomes of the controllers in Buck and Boost modes. (a) 80 V in Buck mode, (b) 50 V in Buck mode, (c) 100 V in Boost mode, (d) 80 V in Boost mode, (e) 120 V in Boost mode.

Figure 7. Controller performance under sudden reference signal variations: (a) Buck mode, (b) Boost mode.

Figure 8. Impact of supply voltage variation on the performance of the controllers; (a) tracking in Buck mode, (b) tracking in Boost mode.

Figure 9. Performance of the controllers under load variations: (a) Performance in Buck mode, (b) Performance in Boost mode.

Figure 10. Tracking regulation of the controller with noise: (a) Buck mode, (b) Boost mode.

Figure 11. Real-time experimental setup implemented using the Typhoon HIL 404 simulator and TI LaunchXL-F28379D controller [12].

Figure 12. Real-time testing includes: (a) a testing model with the proposed controller in an HIL imitating environment, and (b) the HIL-corresponding SCADA platform.

Figure 13. Tracking outcomes of the controllers in Buck and Boost mode: (a) tracking 100 V in Boost mode, (b) tracking 70 V in Boost mode, (c) tracking 90 V in Buck mode, (d) tracking 60 V in Buck mode.

Figure 14. Controller performance under abrupt reference signal changes, including: (a) Buck mode, (b) Boost mode.

Figure 15. Impact of supply voltage variation on the performance of the controllers: (a) tracking in Boost mode, (b) tracking in Buck mode.

Figure 16. Performance of the controllers under load variations: (a) tracking in Buck mode, (b) tracking in Boost mode.

Figure 17. Performance of the controllers under Noise injection: (a) tracking in Buck mode, (b) tracking in Boost mode.

Figure 18. Performance of the controllers under step pulse changes: (a) tracking in Buck mode, (b) tracking in Boost mode.

Table 1. Values and their definitions in Buck and Boost modes.

Component	Definition	Boost	Buck
$V_{o u t}$	Output Voltage	60–120 V	50–120 V
E	Supply Voltage	10–50 V	120–60 V
L	Inductor	2 μH	2 μH
R	Resistor (Load)	50 $Ω$	50 $Ω$
C	Capacitor	100 μF	100 μF
f	Switching Frequency	20 kHz	20 kHz

Table 2. Neural network architecture and training parameters used in the DDPG agent.

Component	Description
Actor network architecture	Input layer: 3 neurons $[e, \dot{e}, \int e]$ ; hidden layer 1: 10 neurons (ReLU); hidden layer 2: 6 neurons (ReLU); output layer: 3 neurons $(k_{1}, k_{2}, k_{3})$ with `tanh` activation scaled to safe gain ranges (Figure 4a).
Critic network architecture	Inputs: state vector + action vector; hidden layers: 2 layers of 64 neurons each (ReLU activations); output: single Q-value (linear).
Optimizers	Adam; learning rate $10^{- 4}$ for actor, $10^{- 3}$ for critic.
Replay buffer	Capacity: 100,000 transitions; mini-batch size: 64.
Target networks	Soft update with factor $τ = 0.005$ .
Discount factor	$γ = 0.99$ .
Exploration strategy	Ornstein–Uhlenbeck noise ( $θ = 0.15$ , $σ$ decays from 0.2 to 0.05).
Training schedule	300 episodes, 6000 steps per episode; warm-up of 5000 random steps before training begins.
Regularization	Gradient clipping (max norm = 1.0) and normalization of state observations.
Initialization	Gains initialized offline using Grey Wolf Optimization (GWO), ensuring stable exploration from the start.
Training outcome	Cumulative reward increased steadily; fluctuations in early episodes due to noise; convergence achieved as actor policy stabilized. Training/test losses converge towards zero (Figure 4b).

Table 3. Parameter configuration and selected gains of metaheuristic algorithms for initial controller gain tuning.

Algorithm	Population (P)	Iterations (T)	Dimension (d)	Algorithm-Specific Settings	Selected Gains $(k_{1}, k_{2}, k_{3})$
GWO	30 wolves	100	3	Coefficient vectors $α, β$	(10.2, 5.1, 0.8)
ALO	30 ants	100	3	Roulette wheel selection	(11.4, 5.6, 0.9)
PSO	30 particles	100	3	Inertia $w = 0.7$ , $c_{1} = c_{2} = 1.5$	(9.8, 4.9, 0.7)
GA	30 chromosomes	100	3	Crossover = 0.8, mutation = 0.1	(12.0, 6.2, 1.0)

Table 4. Comparison of optimization methods for controller gain initialization.

Algorithm	Convergence Iteration	Final IAE	Std. Deviation	Param. Sensitivity	CPU Time (s)	Final Fitness
GWO	52	1.62	0.007	1.5%	17.4	0.125
ALO	57	1.73	0.009	1.8%	18.6	0.138
PSO	74	2.38	0.026	3.7%	19.8	0.196
GA	85	2.61	0.033	4.2%	31.4	0.209

Table 5. Computational complexity and real-time feasibility of controllers.

Controller	Online Effort	Memory	Sampling Time (µs)	Exec. Time (µs)	Delay (µs)
ABSC	Few algebraic ops. (fixed)	low	100	10	3
RL–PID	Error calc., sum/diff, clamp	low	100	5	2
RL–ABSC	Actor NN + ABSC algebra	Moderate	100	25	5

Table 6. Quantitative performance metrics of controllers in Buck and Boost modes (corresponding to Figure 6).

Controller	Case	Rise Time (ms)	Overshoot (%)	Settling Time (ms)	RMSE (V)	IAE
ABSC	Buck	6.5	3.8	12.0	0.42	5.1
	Boost	7.2	4.1	13.5	0.47	5.6
RL–PID	Buck	4.2	6.7	10.5	0.36	4.3
	Boost	4.8	7.2	11.0	0.41	4.9
Proposed RL–ABSC	Buck	3.1	2.1	7.8	0.25	3.2
	Boost	3.6	2.5	8.4	0.29	3.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ghamari, S.M.; Aziz, A.; Ghahramani, M. Design of Robust Adaptive Nonlinear Backstepping Controller Enhanced by Deep Deterministic Policy Gradient Algorithm for Efficient Power Converter Regulation. Energies 2025, 18, 4941. https://doi.org/10.3390/en18184941

AMA Style

Ghamari SM, Aziz A, Ghahramani M. Design of Robust Adaptive Nonlinear Backstepping Controller Enhanced by Deep Deterministic Policy Gradient Algorithm for Efficient Power Converter Regulation. Energies. 2025; 18(18):4941. https://doi.org/10.3390/en18184941

Chicago/Turabian Style

Ghamari, Seyyed Morteza, Asma Aziz, and Mehrdad Ghahramani. 2025. "Design of Robust Adaptive Nonlinear Backstepping Controller Enhanced by Deep Deterministic Policy Gradient Algorithm for Efficient Power Converter Regulation" Energies 18, no. 18: 4941. https://doi.org/10.3390/en18184941

APA Style

Ghamari, S. M., Aziz, A., & Ghahramani, M. (2025). Design of Robust Adaptive Nonlinear Backstepping Controller Enhanced by Deep Deterministic Policy Gradient Algorithm for Efficient Power Converter Regulation. Energies, 18(18), 4941. https://doi.org/10.3390/en18184941

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Design of Robust Adaptive Nonlinear Backstepping Controller Enhanced by Deep Deterministic Policy Gradient Algorithm for Efficient Power Converter Regulation

Abstract

1. Introduction

2. Mathematical Modeling of the Converters

2.1. Boost Converter

2.2. State Variables and Input Definition

2.3. Circuit Equations

2.3.1. Switch ON (Q Closed)

2.3.2. Switch OFF (Q Open)

2.4. Averaged State-Space Model

2.5. Buck Converter

2.6. State Variables and Input Definition

2.7. Mode 1: Switch ON (Q Closed)

2.8. Mode 2: Switch OFF (Q Open)

2.9. Averaged State-Space Model

3. Controllers

Adaptive Backstepping Technique

4. RL-Based Optimization of ABSC Gains

4.1. Problem Setup

4.2. DDPG Agent Architecture

4.3. Mathematical Formulation

4.4. Learning Loop and Training Behavior

4.5. Stability Consideration

4.6. Neural Network Training and Architecture

4.7. Grey Wolf Optimization for Initial Gain Tuning

Comparative Analysis of Optimization Strategies

5. Result of Simulations

5.1. Case 1: Output Regulation

5.2. Case 2: Supply Voltage Variation

5.3. Case 3: Load Uncertainty

5.4. Case 3: Noise Impact

6. Experimental Implementation

6.1. Case: Tracking Performance

6.2. Case: Supply Voltage Variation

6.3. Case: Load Variation

6.4. Case: Noise Impact

6.5. Case: Sudden Variation

7. Discussion

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI