TD3-Based Reinforcement Learning for Adaptive PID-like Control of Uncertain Dynamical Systems

Demircioğlu, Ufuk; Bakır, Halit; Almarri, Badar; Abdul Hafez, A. H.

doi:10.3390/math14101744

Open AccessArticle

TD3-Based Reinforcement Learning for Adaptive PID-like Control of Uncertain Dynamical Systems

¹

Faculty of Engineering and Natural Sciences, Sivas University of Science and Technology, Sivas 58000, Turkey

²

Department of Computer Science, College of Computer Sciences and Information Technology, King Faisal University, Al-Ahsa 36362, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(10), 1744; https://doi.org/10.3390/math14101744

Submission received: 10 April 2026 / Revised: 14 May 2026 / Accepted: 15 May 2026 / Published: 19 May 2026

(This article belongs to the Topic Modeling, Stability, and Control of Dynamic Systems and Their Applications)

Download

Browse Figures

Versions Notes

Abstract

This paper presents a TD3-based reinforcement learning framework for adaptive PID-like control of uncertain dynamical systems. Although proportional–integral–derivative (PID) control remains widely used because of its simplicity, interpretability, and practical effectiveness, fixed-gain PID controllers often experience performance degradation in the presence of external disturbances, parameter variations, and changing operating conditions. To address this limitation, the control task is formulated as a continuous-action reinforcement learning problem in which the observation vector is constructed from PID-related error components, namely the tracking error, its integral, and its derivative. Based on these error-derived observations, a Twin Delayed Deep Deterministic Policy Gradient (TD3) agent learns a bounded continuous control policy through interaction with the environment while preserving a PID-like structural interpretation. The proposed framework is evaluated on a representative mass–spring–damper system under three challenging scenarios: external disturbance, parametric uncertainty, and their simultaneous presence. Its performance is further examined for both constant-reference regulation and sinusoidal reference tracking. The simulation results show that the learned controller achieves stable and accurate tracking, fast transient response, and robust behavior across varying operating conditions. These findings demonstrate the potential of TD3-based reinforcement learning as an effective adaptive PID-like control strategy for uncertain dynamical systems.

Keywords:

adaptive PID control; reinforcement learning; policy learning; TD3 algorithm; robust tracking; parametric uncertainty; continuous control

MSC:

68T05

1. Introduction

Control systems in real-world applications are required to operate reliably under uncertainty, external disturbances, and varying operating conditions. In industrial automation, robotics, and mechatronic systems, controllers must achieve fast transient response, low steady-state error, and robust closed-loop performance despite modeling inaccuracies and unmeasured perturbations. These challenges motivate the development of adaptive control strategies capable of maintaining performance when nominal assumptions are violated.

The proportional–integral–derivative (PID) controller continues to be one of the most widely used control structures in practice because of its simplicity, interpretability, low implementation cost, and effectiveness across a broad range of engineering processes [1,2]. Its widespread adoption in industry is also supported by the fact that PID control offers an intuitive relationship between the tracking error and the generated control action through proportional, integral, and derivative terms. However, the practical success of PID control depends strongly on the appropriate selection of the gains

K_{p}

,

K_{i}

, and

K_{d}

. Classical tuning rules such as Ziegler–Nichols and Cohen–Coon provide convenient initial settings for nominal operating conditions, but their performance often deteriorates in the presence of plant uncertainty, time-varying dynamics, actuator limitations, and persistent disturbances [1,2,3,4]. Under such conditions, fixed-gain PID controllers may suffer from oscillation, degraded tracking accuracy, longer settling time, and reduced robustness.

To address these limitations, a large body of work has investigated improved PID tuning strategies. Model-based approaches, including frequency-domain design and internal model control, provide more systematic tuning procedures when an accurate plant model is available [4]. In addition, heuristic and metaheuristic optimization techniques such as genetic algorithms, particle swarm optimization, simulated annealing, ant colony optimization, and differential evolution have been widely used to search for high-quality PID parameters in nonlinear, constrained, or multi-objective settings [5,6,7,8,9,10,11,12,13]. These approaches have shown that substantial performance gains can be achieved over hand-tuned or rule-based settings. Nevertheless, in most cases the obtained controller parameters are determined offline. As a result, once the operating condition changes or the plant deviates from its nominal model, the previously optimized gains may no longer remain satisfactory.

Recent advances in reinforcement learning (RL), particularly deep reinforcement learning (DRL), have opened new directions for adaptive control and controller auto-tuning. Unlike purely offline optimization methods, RL learns through repeated interaction with the environment and can improve decision policies directly from observed closed-loop behavior. For continuous-control tasks, actor–critic algorithms such as Deep Deterministic Policy Gradient (DDPG), Twin Delayed Deep Deterministic Policy Gradient (TD3), and Soft Actor–Critic (SAC) have demonstrated strong capability in learning smooth real-valued control policies [14,15,16]. Among these methods, TD3 is particularly suitable for control applications due to its improved training stability, achieved through clipped double critics, delayed policy updates, and target policy smoothing. These mechanisms reduce overestimation bias and enhance convergence reliability in deterministic continuous-control settings [15]. These characteristics make TD3 a strong candidate for adaptive control problems in which the policy must operate in a continuous action space while maintaining stability under uncertainty.

The use of RL for PID-related control design has therefore attracted increasing attention. Existing studies have investigated autonomous PID tuning, actor–critic-based self-tuning PID formulations, and model-based RL strategies for controller adaptation [17,18,19]. More recent studies have further explored advanced deep RL algorithms for adaptive PID control in nonlinear environments and industrial settings [20,21,22]. Additional work on incremental RL-based PID adaptation and adaptive deep reinforcement learning for PID-controlled robotic systems further confirms the growing relevance of learning-based PID design in modern control applications [23,24,25,26]. Collectively, these studies demonstrate that RL can serve as a viable mechanism for improving controller adaptability beyond fixed-gain tuning.

Despite this progress, several limitations remain in the current literature. First, many existing RL-based PID studies evaluate controller performance under relatively simplified conditions, such as nominal plant parameters, a single disturbance source, or a single operating mode. Second, the simultaneous presence of internal parameter uncertainty and external disturbance is less frequently addressed, even though this combination is common in practical systems and can significantly degrade closed-loop performance. Third, many learning-based control formulations emphasize policy performance but provide limited structural interpretability in relation to classical PID behavior. Finally, a number of studies focus either on set-point regulation or on trajectory tracking, whereas comparatively fewer works examine both constant-reference and time-varying reference tracking within a unified learning-based framework.

To address these limitations, this study proposes a reinforcement-learning-based adaptive PID control framework that explicitly considers both uncertainty and disturbance while preserving the interpretability of classical control structures. From the control perspective, the objective is to generate a control input

u (t)

such that the plant output

y (t)

follows a desired reference signal

r (t)

while minimizing the tracking error

e (t) = r (t) - y (t) .

(1)

In classical PID control, the control law is expressed through the proportional, integral, and derivative gains

K_{p}

,

K_{i}

, and

K_{d}

, respectively. In adaptive PID control, these quantities are no longer treated as fixed design constants, but are instead adjusted according to the current closed-loop condition.

In this work, the adaptive control problem is formulated as a continuous-control reinforcement learning task represented by a Markov decision process (MDP)

(S, A, P, R, γ)

. The observation vector is constructed directly from PID-related error signals,

s_{t} = [e (t), \int e (t) d t, \frac{d e (t)}{d t}],

(2)

so that the policy receives the proportional, integral, and derivative components of the tracking behavior at each time step. Unlike conventional gain-output formulations, the implemented TD3 agent directly generates a bounded scalar control action applied to the plant, while the actor architecture preserves a PID-like interpretation through learnable parameters associated with the error components. In this sense, the method combines the adaptability of DRL with the structural interpretability of PID control rather than replacing classical control concepts with a fully opaque black-box policy.

The proposed framework is developed and evaluated on a mass–spring–damper system operating under three challenging scenarios: external disturbance, parametric uncertainty, and their simultaneous presence. In addition, the learned controller is tested on both constant-reference regulation and sinusoidal reference tracking in order to assess its robustness across different control objectives. Figure 1 illustrates the overall concept of the proposed approach, in which a TD3-based agent learns a control policy from error-derived observations while the physical system is exposed to uncertainty and disturbance.

The remainder of this paper is organized as follows. Section 2 presents the system model and problem formulation. Section 3 describes the reinforcement learning framework and the TD3-based adaptive control architecture. Section 4 details the experimental setup and simulation conditions. Section 5 reports and discusses the results. Finally, Section 6 concludes the paper and outlines possible future directions.

Motivation, Contributions, and Novelty

Motivated by the need for adaptive and robust control under realistic operating conditions, this paper investigates reinforcement-learning-based PID tuning for a mass–spring–damper system using the TD3 algorithm in the presence of both internal uncertainty and external disturbances. Rather than replacing the classical PID controller, the proposed method preserves its simple and interpretable structure while introducing online gain adaptation through deep reinforcement learning.

The main contributions of this paper are summarized as follows:

A TD3-based reinforcement learning framework is developed for adaptive PID tuning in a continuous position-control problem, leveraging the stability advantages of TD3 in continuous action spaces.
The controller is evaluated under three challenging scenarios: external disturbances only, parametric uncertainty only, and the simultaneous presence of both.
The learned controller is analyzed for both constant-reference regulation and time-varying reference tracking.
The proposed framework preserves the classical PID structure while enhancing its adaptability through deep reinforcement learning, providing a practical hybrid control strategy for uncertain environments.

Furthermore, in contrast to existing reinforcement-learning-based PID tuning approaches, the proposed framework does not explicitly output PID gains as independent action variables. Instead, the control signal is generated as a bounded continuous action, while the PID-like structure emerges implicitly through the actor network operating on proportional, integral, and derivative error components. This design provides a structurally interpretable learning-based controller, in contrast to many previous approaches that rely on black-box policies or explicit gain-output formulations with limited interpretability.

From a technical perspective, a key challenge addressed in this work is the simultaneous presence of parametric uncertainty and external disturbances. Unlike many prior studies that consider only nominal conditions or a single source of uncertainty, the proposed method explicitly trains and evaluates the controller under multiple adverse scenarios, including disturbance-only, uncertainty-only, and their combined effect. This significantly increases the complexity of the learning problem and requires the agent to generalize across multiple sources of variability.

Furthermore, the proposed formulation integrates a PID-informed state representation with a continuous control action space, enabling a natural combination of classical control principles and reinforcement learning. In addition, the controller is evaluated under both constant-reference regulation and time-varying reference tracking, providing a more comprehensive assessment of robustness compared to studies focusing on a single control objective.

Overall, the main motivation of this work is to bridge the gap between classical PID interpretability and deep reinforcement learning adaptability by preserving the PID structure while enabling online learning-based adaptation under realistic and challenging operating conditions.

The remainder of this paper is organized as follows. Section 2 describes the system model, the simulation environment, and the TD3-based learning architecture. Section 3 presents the experimental results obtained under different disturbance and uncertainty scenarios. Section 4 concludes the paper and discusses future research directions.

2. Background and Methodology

This section describes the physical system model, the reinforcement learning formulation, and the TD3-based adaptive control architecture used in this study. The objective is to enable an RL agent to learn the PID controller gains online in the presence of disturbances and parameter uncertainty. The overall workflow of the proposed framework is illustrated in Figure 1. The main symbols used throughout the paper are summarized in Table 1.

2.1. Mass–Spring–Damper System Model

The mass–spring–damper system is a standard benchmark in control engineering because it captures representative second-order dynamics while remaining analytically tractable for modeling, simulation, and controller evaluation [27]. The system consists of a mass m, spring constant k, and damping coefficient c, and its motion is governed by

m \ddot{x} (t) + c \dot{x} (t) + k x (t) = u (t) + d (t),

(3)

where

x (t)

is the position of the mass,

u (t)

is the control force generated by the controller, and

d (t)

represents an external disturbance.

The corresponding state-space representation follows standard linear systems modeling practice and is widely used for controller analysis studies [27]. The system can be written in state-space form as

{\dot{x}}_{1} = x_{2},

(4)

{\dot{x}}_{2} = - \frac{k}{m} x_{1} - \frac{c}{m} x_{2} + \frac{1}{m} u + \frac{1}{m} d .

(5)

Parametric uncertainty is introduced by varying the mass parameter m randomly within a predefined range, while disturbance forces are applied as time-varying inputs.

For completeness, the system can be expressed in the standard compact state-space form as

\dot{X} (t) = A X (t) + B u (t) + B d (t),

(6)

Y (t) = C X (t),

(7)

where

X (t) = [\begin{matrix} x_{1} (t) \\ x_{2} (t) \end{matrix}], Y (t) = x_{1} (t),

(8)

and the system matrices are defined as

A = [\begin{matrix} 0 & 1 \\ - \frac{k}{m} & - \frac{c}{m} \end{matrix}], B = [\begin{matrix} 0 \\ \frac{1}{m} \end{matrix}],

(9)

C = [\begin{matrix} 1 & 0 \end{matrix}] .

(10)

2.2. PID Control Structure

The PID controller remains one of the most widely adopted feedback control structures because of its simple architecture, intuitive interpretation, and broad practical applicability across engineering domains [28,29]. The control objective is to make the system output follow a reference signal

r (t)

by minimizing the tracking error

e (t) = r (t) - x (t) .

(11)

The classical PID controller generates the control signal

u (t) = K_{p} e (t) + K_{i} \int e (t) d t + K_{d} \frac{d e (t)}{d t},

(12)

where

K_{p}

,

K_{i}

, and

K_{d}

are the proportional, integral, and derivative gains.

However, the performance of a fixed-gain PID controller is highly dependent on the tuning of its parameters, and satisfactory nominal tuning does not necessarily translate into robust behavior under uncertainty, disturbance, or time-varying operating conditions [28,29]. In this study, these gains are not fixed but are updated online by a reinforcement learning agent.

3. Reinforcement Learning Formulation

The adaptive control problem is formulated as a continuous-state, continuous-action Markov decision process (MDP), which is the standard mathematical framework used to describe sequential decision-making problems in reinforcement learning [30].

M = (S, A, P, R, γ),

(13)

where

S

denotes the state space,

A

denotes the action space, P is the transition dynamics, R is the reward function, and

γ

is the discount factor.

Within this formulation, the agent interacts with the environment by observing a state, selecting an action according to its policy, receiving a scalar reward, and transitioning to the next state, with the objective of maximizing the expected discounted return [30].

The observation space is defined as a three-dimensional real-valued vector,

s_{t} = [\begin{matrix} e (t) \\ \int e (t) d t \\ \frac{d e (t)}{d t} \end{matrix}] \in R^{3},

(14)

where

e (t)

is the tracking error,

\int e (t) d t

is the accumulated error, and

\frac{d e (t)}{d t}

is the error derivative. These three quantities correspond to the proportional, integral, and derivative components used in classical PID control and are provided to the agent as the observation vector at each interaction step. Using these observation components allows the learning agent to operate on PID-related information while remaining within the standard RL state–action interaction framework [30].

Unlike the abstract PID-gain formulation sometimes used for conceptual explanation, the actual implementation defines the action space as a scalar continuous control input,

a_{t} = u (t) \in [- u_{m a x}, + u_{m a x}],

(15)

where the lower and upper bounds are imposed directly through the action specification of the reinforcement learning environment. Thus, the agent outputs the control force applied to the mass–spring–damper system rather than explicitly outputting a three-dimensional gain vector. Bounding the action space can help restrict the search domain and improve practical training behavior, but such constraints alone do not constitute a formal stability guarantee for the learned controller [31,32].

Accordingly, the actor network implements a deterministic policy of the form

u (t) = π_{θ} (s_{t}),

(16)

where

π_{θ}

is parameterized by the learnable actor weights

θ

. The actor network consists of a featureInputLayer followed by a custom fullyConnectedPILayer. This custom layer computes a linear mapping using the absolute values of its learnable parameters,

u (t) = \sum_{i = 1}^{3} | θ_{i} | s_{t, i},

(17)

where

s_{t} = {[\begin{matrix} s_{t, 1} & s_{t, 2} & s_{t, 3} \end{matrix}]}^{T} = {[\begin{matrix} e (t) & \int e (t) d t & \frac{d e (t)}{d t} \end{matrix}]}^{T} .

(18)

Because the observation vector is composed of PID-related signals, this actor structure behaves as a PID-like adaptive control law whose effective coefficients are learned directly from data. The use of the absolute-value operator enforces non-negative effective weights in the implemented layer.

The actor is initialized with

θ^{(0)} = [\begin{matrix} 10^{- 3} & 2 & 1 \end{matrix}],

(19)

which provides a meaningful starting point for the learning process. After training, the learnable actor parameters are retrieved and interpreted as PID-related coefficients according to the implementation:

K_{p} = | θ_{1} |, K_{i} = | θ_{2} |, K_{d} = | θ_{3} | .

(20)

Hence, the PID interpretation is not imposed by directly defining the action as a gain vector, but rather emerges from the structure of the actor network and the post-training extraction of its parameters.

The TD3 agent further employs two critic networks,

Q_{ϕ_{1}} (s_{t}, a_{t}), Q_{ϕ_{2}} (s_{t}, a_{t}),

(21)

each of which receives the observation and action as inputs and estimates the expected return associated with the corresponding state–action pair. In the implementation, the critics use separate state and action input branches, followed by concatenation and fully connected layers, in order to approximate the action-value function. This structure is consistent with the continuous-control formulation adopted by TD3.

The reward function is computed by the Simulink environment and returned to the learning agent together with the next observation. Therefore, from the reinforcement learning viewpoint, the interaction at each time step is governed by

(s_{t}, a_{t}) ⟶ (r_{t}, s_{t + 1}),

(22)

where the agent observes the current error-based state, applies a bounded scalar force, and receives both the next state and the scalar reward from the environment.

With this formulation, the reinforcement learning problem can be summarized as follows: the TD3 agent learns a deterministic policy that maps the PID-related observation vector in (14) to a scalar control force in (15), while the learned actor parameters retain a PID-interpretable structure through the custom fully Connected PI Layer. This formulation enables online adaptive control while preserving the interpretability of classical PID behavior.

3.1. Reward Function

The reward is designed to penalize both tracking error and excessive control effort, which is consistent with standard control design practice in which accuracy and actuation cost are balanced within a single performance objective [27]. The reward function is designed to promote accurate reference tracking while discouraging unnecessarily large control actions.

Accordingly, the instantaneous reward at time step t is defined as

r_{t} = - (α e {(t)}^{2} + β u {(t)}^{2}),

(23)

where

e (t)

denotes the tracking error,

u (t)

is the control input, and

α

and

β

are non-negative weighting coefficients.

In this study, the weighting parameters are selected as

α = 100

and

β = 10

based on empirical evaluation. From a reinforcement-learning perspective, the specific reward shaping strongly influences the learned policy, convergence behavior, and the trade-off between transient performance and control smoothness [30,33].

The parameter

α

determines the emphasis on minimizing tracking error, while

β

penalizes excessive control effort and promotes smoother control signals. Larger values of

α

lead to more aggressive error minimization, whereas higher values of

β

result in more conservative control actions. The selected values were determined through preliminary experiments to achieve a balance between fast convergence, accurate tracking, and stable control behavior.

A systematic ablation study over a wider range of parameter values is considered as an important direction for future work.

The chosen values prioritize accurate tracking while maintaining moderate control effort, thereby enabling stable learning and robust closed-loop performance under uncertainty and disturbance.

The first term penalizes deviations from the reference signal, thereby encouraging the agent to minimize tracking error. The second term penalizes excessive control effort, which helps avoid unnecessarily aggressive actions and promotes smoother closed-loop behavior.

The weighting parameters

α

and

β

are selected empirically so as to balance tracking accuracy and control smoothness, ensuring that neither objective dominates the learning process. As a result, the reward formulation guides the agent toward control policies that achieve three desirable properties simultaneously: small tracking error, moderate control effort, and stable transient response. The weighting coefficients are selected empirically to balance regulation quality and control smoothness, since reward design in RL-based control is task dependent and can materially affect the resulting policy behavior [33].

This quadratic penalty structure is also consistent with standard control design principles, since it encourages the learned policy to regulate the system efficiently while maintaining robustness under uncertainty and disturbance.

3.2. TD3-Based Adaptive Control Architecture

Among deep reinforcement learning methods for continuous control, actor–critic algorithms such as DDPG, TD3, and SAC have been widely adopted because they can learn real-valued control policies directly from interaction data [14,15,16]. The Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm is employed as the learning engine of the proposed adaptive control framework. The overall TD3-based adaptive control architecture is shown in Figure 2.

TD3 is an actor–critic reinforcement learning algorithm designed for continuous-action problems and is particularly suitable for control applications in which the policy must generate smooth real-valued commands. Compared with the Deep Deterministic Policy Gradient (DDPG) method, TD3 improves learning stability through three main mechanisms: (i) twin critic networks, (ii) delayed policy updates, and (iii) target policy smoothing. These mechanisms reduce overestimation bias, stabilize critic learning, and improve convergence in continuous-control environments.

In this work, TD3 is adopted because it extends deterministic actor–critic learning with mechanisms that mitigate value overestimation and improve training stability in continuous-control settings [15]. Compared with other actor–critic methods such as DDPG and SAC, TD3 provides a more suitable framework for the proposed adaptive control problem. While DDPG is known to suffer from overestimation bias and training instability, TD3 mitigates these issues through twin critics, delayed policy updates, and target policy smoothing.

In contrast, SAC employs stochastic policies with entropy regularization, which improves exploration but may introduce variability in the control signal. For control applications based on PID principles, deterministic and smooth control actions are generally preferred.

Therefore, TD3 is selected in this study as it provides a stable deterministic policy formulation that aligns well with classical control requirements while maintaining robustness under uncertainty and disturbance. To enhance the interpretability of the proposed controller, we provide a formal analysis of the relationship between the actor parameters and the classical PID gains.

Starting from the actor mapping in (17) and the state definition in (14), the control action in (17) can be rewritten explicitly as

u (t) = | θ_{1} | e (t) + | θ_{2} | \int e (t) d t + | θ_{3} | \frac{d e (t)}{d t} .

(24)

Comparing this expression with the classical PID control law in (12) establishes a direct one-to-one correspondence between the actor parameters and the effective PID gains, restated from (20) as

K_{p} = | θ_{1} |, K_{i} = | θ_{2} |, K_{d} = | θ_{3} | .

This derivation, from (20) to (24), shows that the proposed controller is mathematically equivalent to a PID structure, where the gains are learned adaptively through reinforcement learning rather than being manually tuned.

Furthermore, the use of absolute value parametrization ensures non-negative effective gains, which is consistent with standard PID design practices and contributes to stable and interpretable control behavior.

3.3. Actor Network

The actor network represents the deterministic policy

a_{t} = π_{θ} (s_{t}),

(25)

where

a_{t}

is the action generated by the policy for the current state

s_{t}

. In the actual implementation, the action is defined as a scalar control force bounded in a continuous range, not as an explicit three-dimensional gain vector. Thus,

a_{t} = u (t) \in [u_{min}, u_{max}],

(26)

where the bounds are imposed through the action specification of the reinforcement learning environment.

The actor is implemented using a featureInputLayer followed by a custom fullyConnectedPILayer. This custom layer computes a weighted linear combination of the observation vector using the absolute values of its learnable parameters. Therefore, although the actor outputs a scalar control signal, the internal structure of the actor preserves a PID-like interpretation. Specifically, if the learnable parameter vector is denoted by

θ = [θ_{1}, θ_{2}, θ_{3}],

(27)

then the actor output may be restated using the same numbered expression as (24):

u (t) = | θ_{1} | e (t) + | θ_{2} | \int e (t) d t + | θ_{3} | \frac{d e (t)}{d t} .

This design provides two advantages. First, it preserves the interpretability of a classical PID controller because the actor acts directly on the proportional, integral, and derivative error components. Second, it allows the policy parameters to be optimized using reinforcement learning without manually tuning the PID gains.

The actor is initialized with a meaningful starting parameter vector, restated from (19) as

θ^{(0)} = [10^{- 3}, 2, 1] .

After training, the learned parameters are extracted from the actor and interpreted as effective PID-related coefficients. Hence, the controller remains explainable even though its parameters are optimized through a data-driven learning process. The choice of a shallow actor network is intentional and motivated by the structure of the control problem. Since the observation vector consists of PID-related error components and the control law is represented as a linear combination of these components, the underlying mapping is inherently low-dimensional. Therefore, a shallow architecture is sufficient to capture the required input–output relationship without introducing unnecessary complexity.

From a reinforcement learning perspective, the actor network serves as a function approximator for the control policy. In problems where the optimal policy can be effectively represented by a structured linear mapping, deeper architectures may increase model capacity but do not necessarily improve performance. Instead, they may lead to overfitting, slower convergence, and reduced training stability in continuous control tasks.

Moreover, deeper networks would introduce additional nonlinear transformations that obscure the direct relationship between network parameters and control behavior. In contrast, the proposed shallow structure preserves a transparent mapping between the learnable parameters and the effective PID gains, which is essential for interpretability.

Overall, the shallow actor network provides an appropriate balance between expressiveness, stability, computational efficiency, and interpretability for the considered control problem.

3.4. Critic Networks (Twin Q-Networks)

The use of two critic networks follows the original TD3 formulation, in which the minimum of the two target value estimates is used to reduce overestimation bias during critic learning [15].

A distinguishing feature of TD3 is the use of two independent critic networks, denoted by

Q_{ϕ_{1}} (s_{t}, a_{t})

and

Q_{ϕ_{2}} (s_{t}, a_{t})

as in (21), which estimate the action-value function for the same state–action pair. The two critics operate in parallel and do not depend on one another. During target computation, TD3 uses

min (Q_{ϕ_{1}} (s_{t}, a_{t}), Q_{ϕ_{2}} (s_{t}, a_{t}))

(28)

to reduce overestimation bias, which is a known weakness of deterministic actor–critic methods.

In the implemented architecture, each critic receives both the observation vector and the action as inputs. These two inputs are first processed through separate branches: a state branch and an action branch. The resulting features are then fused using a concatenation layer, followed by a common fully connected processing path. In this study, the shared hidden layers of each critic have dimensions

2048 \to 1024 \to 1,

(29)

with nonlinear activation between the fully connected layers. The final scalar output corresponds to the estimated Q-value, i.e., the expected discounted return associated with the current state and action.

This critic design is important because it allows the TD3 agent to evaluate how appropriate a selected control force is under the current tracking condition. The critic estimates are then used to guide actor updates during training.

3.5. Target Networks

To improve convergence stability, TD3 employs target networks for both the actor and the critics. Let

π_{θ^{'}}

denote the target actor and

Q_{ϕ_{1}^{'}}

,

Q_{ϕ_{2}^{'}}

denote the target critics. These networks are not updated abruptly. Instead, they are updated gradually through soft updates of the form

θ^{'} \leftarrow τ θ + (1 - τ) θ^{'},

(30)

ϕ_{i}^{'} \leftarrow τ ϕ_{i} + (1 - τ) ϕ_{i}^{'}, i \in {1, 2},

(31)

where

τ

is a small positive constant. This mechanism prevents rapid oscillations in the target estimates and improves stability during learning. In addition to target networks, TD3 employs a delayed policy update mechanism to improve training stability. Specifically, the actor network is not updated at every time step. Instead, the policy is updated once every d iterations, where d denotes the policy delay parameter. Accordingly, the actor parameters are updated only when

t mod d = 0,

(32)

while the critic networks are updated at every step.

This delayed update strategy reduces variance in policy updates and mitigates the propagation of critic estimation errors to the actor, thereby improving overall learning stability.

3.6. Target Policy Smoothing

In addition, TD3 applies target policy smoothing by perturbing the target action with clipped noise, thereby reducing the tendency of the policy to exploit narrow value peaks caused by function approximation error [15].

Another important component of TD3 is target policy smoothing. During critic updates, the target action is perturbed with Gaussian noise:

{\tilde{a}}_{t + 1} = π_{θ^{'}} (s_{t + 1}) + ϵ, ϵ \sim N (0, σ^{2}) .

(33)

This prevents the critic from exploiting unrealistically sharp peaks in the approximated Q-function and encourages smoother control policies. In practice, this regularization is especially useful in continuous control tasks where aggressive action changes may lead to unstable or highly oscillatory behavior.

3.7. Delayed Policy Updates

Experience replay is used to decorrelate consecutive samples and improve data efficiency, which is standard practice in off-policy deep actor–critic methods for continuous control [14,15].

Unlike standard actor–critic methods in which actor and critic updates are performed at the same frequency, TD3 updates the actor more slowly than the critics. The critics are updated at every training step, whereas the actor is updated only after a specified delay. This design ensures that the actor is improved using sufficiently accurate Q-value estimates rather than rapidly fluctuating critic outputs. In accordance with the TD3 algorithm, the actor network is not updated at every training step. Instead, the actor parameters are updated only once every d critic updates, where d denotes the policy delay factor (set to

d = 2

in this study). This delayed update mechanism improves training stability by ensuring that the actor is updated using more reliable value estimates obtained from the critics. As a result, the policy evolves more smoothly, and the learning process becomes more stable. This mechanism is particularly beneficial in the proposed adaptive control setting because abrupt policy changes may translate into undesirable variations in the effective PID-related parameters.

4. Training Algorithm

4.1. Training Algorithm and Hyperparameter Configuration

The overall training procedure of the proposed adaptive PID control framework is summarized in Algorithm 1. The environment is defined using the observation vector in (14) and a one-dimensional continuous action space following (15), with

u_{max} = 30

in the reported experiments. Thus, in the actual implementation, the agent directly outputs the control force, whereas the effective PID parameters are represented implicitly by the learnable weights of the actor network.

The interaction between the agent and the plant is performed using a fixed sampling time of

T_{s} = 0.1 s,

(34)

with a total episode duration of

T_{f} = 10 s .

(35)

Accordingly, the number of time steps per episode is

N_{step} = ⌈\frac{T_{f}}{T_{s}}⌉ = 100 .

(36)

This sampling interval ensures synchronized data exchange between the TD3 agent and the physical system while providing sufficient temporal resolution to capture the mass–spring–damper dynamics.

Algorithm 1 TD3-Based Adaptive PID Control.

1:: Define observation specification $s_{t} \in R^{3}$ and scalar action specification $a_{t} \in [- 30, 30]$
2:: Construct Simulink environment and set reset function
3:: Initialize actor network $π_{θ}$ using featureInputLayer and fullyConnectedPILayer
4:: Initialize twin critic networks $Q_{ϕ_{1}}, Q_{ϕ_{2}}$
5:: Initialize target networks $π_{θ^{'}}$ , $Q_{ϕ_{1}^{'}}$ , and $Q_{ϕ_{2}^{'}}$
6:: Initialize replay buffer $D$
7:: Set policy delay parameter $d = 2$
8:: for each episode do
9:: Reset environment and obtain initial state $s_{0}$
10:: for each time step $t = 0, \dots, N_{step} - 1$ do
11:: Observe state $s_{t}$ as defined in (14)
12:: Select action

$a_{t} = π_{θ} (s_{t}) + ϵ, ϵ \sim N (0, σ^{2})$
13:: Clip $a_{t}$ within the force bounds $[- 30, 30]$
14:: Apply control force $u (t) = a_{t}$ to the plant
15:: Observe reward $r_{t}$ and next state $s_{t + 1}$
16:: Store transition $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ in $D$
17:: Sample a mini-batch of 128 transitions from $D$
18:: Compute smoothed target action

${\tilde{a}}_{t + 1} = π_{θ^{'}} (s_{t + 1}) + ϵ^{'}$
19:: Compute target Q-value

$y_{t} = r_{t} + γ min_{i \in {1, 2}} Q_{ϕ_{i}^{'}} (s_{t + 1}, {\tilde{a}}_{t + 1})$
20:: Update critic networks by minimizing

$L_{critic} = \sum_{i = 1}^{2} {(Q_{ϕ_{i}} (s_{t}, a_{t}) - y_{t})}^{2}$
21:: if $t mod d = 0$ then
22:: Update actor parameters using the deterministic policy gradient

$\nabla_{θ} J \approx \nabla_{a} Q_{ϕ_{1}} (s, a) |_{a = π_{θ} (s)} \nabla_{θ} π_{θ} (s)$
23:: Soft-update target networks using (30) and (31)
24:: end if
25:: end for
26:: end for
27:: Simulate the trained agent
28:: Extract learnable actor parameters and compute final gains using (20)

4.1.1. Actor Network Implementation

The actor network is implemented as a deterministic continuous actor using a custom shallow neural structure,

u (t) = f_{θ} (s_{t})

, where

f_{θ} (\cdot)

consists of a featureInputLayer followed by a custom fullyConnectedPILayer. The actor is initialized according to (19), which provides meaningful initial PID-related weights. The custom layer computes the PID-like mapping in (17), where the absolute-value operation enforces non-negative effective gains. Since the observation vector is composed of PID-related signals, the actor behaves as a learnable PID-like control law. After training, the gain parameters are extracted from the actor weights using (20). This design preserves the interpretability of PID control while embedding the gain adaptation mechanism inside the actor network.

4.1.2. Critic Networks

The TD3 agent uses two critic networks in parallel. Each critic is implemented as a Q-value function with separate observation and action paths, which are later concatenated. The state path begins with a feature Input Layer followed by a fully connected layer of size 2048, while the action path consists of a feature Input Layer followed by another fully connected layer of size 2048. The two branches are combined using a concatenation Layer, followed by a common path with hidden dimensions

2048 \to 1024 \to 1

, with ReLU activation between the fully connected layers. The output of each critic is a scalar estimate of the expected return,

Q_{ϕ_{i}} (s_{t}, a_{t}), i \in {1, 2} .

(37)

4.1.3. Replay Buffer and Mini-Batch Sampling

The TD3 agent is trained using an experience replay buffer

D = {(s_{t}, a_{t}, r_{t}, s_{t + 1})},

(38)

with an experience buffer length of

| D | = 10^{8} .

(39)

This large replay memory allows the agent to reuse a diverse set of transitions, which is important for stable off-policy learning. At each update step, a mini-batch of

N_{batch} = 128

(40)

samples is drawn uniformly from the replay buffer. This choice provides a practical balance between computational cost and statistical diversity.

4.1.4. Optimization Settings

Separate optimizer options are used for the actor and critic networks. The actor optimizer is configured with

η_{actor} = 10^{- 3},

(41)

while each critic optimizer uses

η_{critic} = 10^{- 4} .

(42)

In both cases, a gradient threshold of

{∥ \nabla ∥}_{max} = 1

(43)

is applied in order to reduce the risk of unstable updates caused by large gradients. Although the learning rates

η_{actor}

and

η_{critic}

are not explicitly listed in Algorithm 1, they are internally utilized by the optimization routines of the reinforcement learning framework during training.

4.1.5. Implementation of Target Policy Smoothing

To further improve learning stability, TD3 applies target policy smoothing. In the implemented agent options, the standard deviation of the target smoothing model is set to

σ_{smooth} = \sqrt{0.2} .

(44)

Accordingly, the smoothed target action is computed as

{\tilde{a}}_{t + 1} = π_{θ^{'}} (s_{t + 1}) + ϵ, ϵ \sim N (0, σ_{smooth}^{2}) .

(45)

This mechanism prevents the critic from exploiting sharp peaks in the Q-function and promotes smoother control behavior.

4.1.6. Training Duration and Stopping Criteria

The training process is carried out for a maximum number of episodes:

N_{ep} = 500

(46)

The training process is conducted for a predefined number of episodes. This stopping criterion is determined based on empirical convergence behavior observed during preliminary experiments. Specifically, the cumulative reward and tracking performance were monitored across multiple training runs, and it was observed that both metrics reached a stable plateau after approximately 500 episodes, indicating convergence of the learned policy.

From a reinforcement learning perspective, such plateau behavior in the reward trajectory is commonly used as an indicator of convergence. Therefore, selecting 500 episodes provides a practical balance between convergence quality and computational efficiency. The training process is executed with

r n g (0),

(47)

which ensures reproducibility of the reported runs.

4.1.7. Simulation After Training

After training, the learned agent is evaluated using

N_{sim} = 100

(48)

simulation steps, corresponding again to the same 10-s horizon. The actor is then retrieved from the trained TD3 agent, and its learnable parameters are used to compute the final effective PID gains.

The complete training workflow is summarized in Algorithm 1.

4.2. Stability and Convergence Considerations

Although reinforcement learning can produce effective feedback policies, it does not in general provide closed-loop stability or safety guarantees unless additional control-theoretic constraints or certification mechanisms are incorporated into the learning framework [31,32].

The proposed control framework combines a classical PID controller with a reinforcement learning policy that updates the controller gains online. Since reinforcement learning algorithms do not inherently guarantee closed-loop stability, it is important to discuss the conditions under which stable behavior can be expected.

In the proposed formulation, the control input is generated by the PID structure in (12), where the effective gain vector

[K_{p}, K_{i}, K_{d}]

is produced by the policy learned using the TD3 algorithm. Because the PID structure is preserved, the controller maintains the fundamental properties of classical feedback control, and the reinforcement learning agent only adjusts the gain values within predefined ranges.

To prevent unstable behavior during learning, the action space of the agent is bounded so that the generated PID gains remain within physically meaningful limits. In addition, the reward function penalizes large tracking error and excessive control effort, which encourages the agent to select gain values that produce smooth and stable responses. This reward shaping plays an important role in guiding the learning process toward stable control policies. This constraint effectively limits the policy search space to stabilizing regions of the PID parameter space, reducing the likelihood of unstable closed-loop behavior during exploration.

The TD3 algorithm further improves stability compared to standard actor–critic methods by using two critic networks, delayed policy updates, and target policy smoothing. These mechanisms reduce the overestimation of action values and prevent abrupt changes in the policy, which contributes to more reliable learning in continuous control problems.

Although no formal Lyapunov stability proof is provided in this work, the simulation results show that the learned policy produces stable closed-loop behavior under external disturbances, parametric uncertainty, and their simultaneous presence. The convergence of the reward values during training and the absence of divergence in the system response indicate that the proposed reinforcement-learning-based tuning method can achieve stable control performance for the considered system. Accordingly, future work may consider integrating Lyapunov-based or barrier-function-based safe reinforcement learning mechanisms in order to obtain stronger theoretical guarantees during both training and deployment [32].

Providing theoretical stability guarantees for reinforcement-learning-based controllers remains an open research problem and will be investigated in future work.

5. Results

5.1. Simulation Environment

The adopted evaluation protocol follows common practice in RL-based continuous-control studies by assessing the learned policy under multiple operating conditions and comparing closed-loop behavior across representative reference and disturbance scenarios [15,24,25,26]. The system and controller are implemented in a Simulink-based environment where the RL agent interacts with the plant at each time step.

External disturbances and parameter variations are applied during training in order to create a realistic control scenario.

Three uncertainty conditions are considered:

External disturbance only
Parametric uncertainty only
Both disturbance and uncertainty

These scenarios allow evaluation of the robustness of the learned controller.

This section presents the simulation results obtained using the TD3-based adaptive PID control framework described in Section 2. The objective of the experiments is to evaluate the ability of the learned policy to maintain stable and accurate position tracking under different uncertainty conditions. During training, the reinforcement learning agent updates the PID gains online according to the state vector composed of the proportional, integral, and derivative components of the tracking error. The learned policy minimizes the reward function while preserving closed-loop stability. The steady-state error remains close to zero, and the transient response is characterized by reduced overshoot and faster settling compared to classical PID tuning observed during preliminary trials.

All simulations are performed in a closed-loop environment where the mass–spring–damper system is controlled by a PID controller whose gains are generated by the actor network of the TD3 algorithm. The controller is evaluated under three different scenarios:

External disturbances;
Parametric uncertainty;
Simultaneous disturbance and uncertainty.

In addition, both constant reference regulation and sinusoidal reference tracking are considered.

5.2. PID Tuning Under External Disturbances

In the first experiment, the controller is trained in the presence of external disturbances while the system parameters are kept constant. The disturbance level is set to

10 %

, and the objective of the learning process is to obtain PID gains that maintain stability despite perturbations.

Figure 3 shows the training progress of the TD3 agent over 500 episodes. The figure presents the episode reward, the average reward, and the predicted Q-value. At the beginning of training, the reward values exhibit large fluctuations due to the exploration phase. After approximately 200 episodes, the reward curve becomes smoother, indicating that the agent has learned a stable policy. The improvement from an average reward of approximately

- 710

to a final reward close to

- 304

confirms that the learning process converges even in the presence of disturbances.

After training, the PID gains extracted from the actor network were

K_{p} = 16.3904, K_{d} = 12.5335, K_{i} = 21.0099 .

These values provide a good balance between fast response and damping, allowing the controller to suppress the disturbance while keeping the system stable.

Figure 4 shows the position tracking response under disturbance. The system initially starts away from the reference and exhibits a short transient oscillation, but the controller quickly stabilizes the motion and converges to the target position.

Figure 5 presents the sinusoidal reference tracking result under disturbance. The controller successfully follows the time-varying reference, showing that the learned policy can adapt to dynamic conditions.

5.3. PID Tuning Under Internal Disturbances

In the second experiment, only internal disturbances are considered. Parametric uncertainty is introduced by varying the mass value randomly within

\pm 10 %

at each step, while no external disturbance is applied. This scenario evaluates the ability of the learned policy to generalize across different plant dynamics.

The training process over 500 episodes is shown in Figure 6. Large reward fluctuations at the beginning indicate exploration, while the gradual increase in the average reward demonstrates policy improvement. After approximately 250 episodes, the reward stabilizes, showing that the agent has learned a reliable control strategy.

The optimized gains obtained after training are

K_{p} = 19.5526, K_{d} = 11.2889, K_{i} = 20.9967 .

Figure 7 shows constant reference tracking under mass variation. The system exhibits a transient overshoot but quickly settles to the reference, indicating that the learned gains remain effective despite parameter changes.

Figure 8 shows sinusoidal reference tracking under parametric uncertainty. The controller maintains accurate tracking with small phase lag, demonstrating that the learned policy can handle time-varying targets.

5.4. PID Tuning Under Both Internal and External Disturbances

In the final experiment, both external disturbance and parametric uncertainty are applied simultaneously. This represents the most challenging scenario and evaluates the robustness of the learned policy.

Figure 9 shows the training performance under combined disturbances. The initial episodes exhibit large reward variance due to the increased difficulty of the learning task. After approximately 300 episodes, the reward becomes stable, indicating convergence of the TD3 policy.

The learned gains are

K_{p} = 13.2692, K_{d} = 4.7384, K_{i} = 5.1016 .

These smaller gains produce a more damped response, which improves stability in the presence of disturbances. Figure 10 shows constant reference tracking under combined disturbances. The system converges to the reference quickly and remains stable despite uncertainty. Figure 11 shows sinusoidal reference tracking in the same conditions. The controller follows the reference accurately, confirming that the learned policy can adapt to both dynamic references and uncertain environments.

6. Limitations and Future Works

Although the proposed TD3-based adaptive PID control framework demonstrated stable and robust performance in the presented experiments, several limitations should be noted.

First, the evaluation was conducted entirely in a simulation environment. While the Simulink-based setup allows controlled testing under different disturbance and uncertainty conditions, real-world implementations may introduce additional challenges such as sensor noise, actuator saturation, time delays, and computational constraints. These factors may affect the performance of reinforcement-learning-based controllers and should be investigated in future experimental studies.

Second, the considered plant is a single-degree-of-freedom mass–spring–damper system. Although this model is widely used as a standard benchmark in control research, it does not fully represent the complexity of real industrial or robotic systems. More complex nonlinear systems, multi-degree-of-freedom dynamics, and coupled processes require further investigation to fully assess the scalability of the proposed framework.

Third, the reward function used in this study was selected empirically in order to penalize tracking error and excessive control effort. Different reward formulations may lead to different learning behavior, convergence speed, and final performance. Designing reward functions that guarantee both stability and optimality remains an open challenge in reinforcement learning-based control.

Another limitation is that the proposed reinforcement-learning-based controller does not provide formal closed-loop stability guarantees, and stability is assessed empirically through observed convergence and tracking performance. To reduce the risk of unstable behavior, the control action is constrained within predefined bounds, and the reward function penalizes excessive control effort, promoting smooth responses.

For real-world deployment, additional safety mechanisms can be incorporated, such as actuator saturation limits, gain constraints, and supervisory safety layers (e.g., fallback PID controllers or safety filters). These mechanisms can ensure that the control system operates within safe limits even under unexpected conditions or model uncertainties.

Finally, the training process requires a relatively large number of episodes in order to obtain a stable policy. This may limit the applicability of the method in situations where training time or computational resources are restricted.

Future work will include systematic validation on widely used nonlinear and multi-degree-of-freedom benchmark systems (e.g., inverted pendulum, cart-pole, and robotic manipulators) to further evaluate the scalability and generalization capability of the proposed framework.

In addition, a systematic ablation study over a wider range of reward parameters (

α

,

β

) is not included in this work due to the high computational cost associated with retraining the reinforcement learning agent. Such an analysis is considered an important direction for future research.

7. Stability Considerations and Safe Deployment

Although the proposed TD3-based adaptive control framework demonstrates strong empirical performance, it does not provide a formal stability guarantee, which is a common limitation of model-free reinforcement learning methods. The learned policy depends on data-driven optimization and may exhibit unpredictable behavior outside the training distribution.

To mitigate potential stability risks, several design choices are incorporated into the proposed framework. First, the control input is explicitly bounded within a predefined range, which prevents excessively large actuation signals and reduces the risk of instability. Second, the reward function penalizes both tracking error and control effort, thereby discouraging aggressive control actions and promoting smoother system response. Third, the actor network preserves a PID-like structure with non-negative effective gains, which enhances interpretability and aligns with classical control principles.

For real-world deployment, additional safety mechanisms can be integrated into the control loop. These include actuator saturation limits, gain constraints, and supervisory safety layers such as fallback PID controllers or safety filters that override the learned policy when unsafe conditions are detected. Incorporating such mechanisms can significantly improve robustness and ensure safe operation under unforeseen disturbances or model mismatch.

Overall, while the proposed approach focuses on adaptive performance improvement, the integration of safety constraints and supervisory control strategies is essential for reliable real-world applications.

To evaluate the generalization capability of the proposed controller, the policy trained under combined uncertainty conditions (including both disturbance and mass variation) is directly tested on individual scenarios without retraining.

Specifically, the trained model is applied to (i) disturbance-only and (ii) parametric uncertainty-only cases. The results show that the learned policy maintains stable tracking performance in both scenarios, demonstrating its ability to generalize across different operating conditions.

This indicates that the proposed TD3-based adaptive PID controller does not require retraining for each specific uncertainty case and can effectively adapt to varying system conditions using a single trained policy.

8. Conclusions

This paper presented a reinforcement-learning-based adaptive PID control framework for position tracking of a mass–spring–damper system operating under external disturbances, parametric uncertainty, and their simultaneous presence. The control problem was formulated as a continuous-action Markov decision process in which the state vector was constructed from the proportional, integral, and derivative components of the tracking error, and the action space corresponded to the PID gain parameters. The Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm was employed to learn an adaptive policy capable of updating the controller gains online.

The simulation results demonstrated that the proposed approach is able to learn stable control policies under different uncertainty conditions. When external disturbances were applied, the learned controller maintained accurate tracking and quickly rejected perturbations. Under parametric uncertainty, the policy successfully adapted the PID gains to different system dynamics without manual tuning. In the most challenging scenario, where both disturbances and parameter variations were present, the controller still achieved stable and accurate tracking for both constant and time-varying reference signals. These results confirm that reinforcement-learning-assisted PID tuning can provide a robust alternative to fixed-gain controllers, while preserving the simplicity of the classical PID structure.

An important observation is that the learned controller automatically adjusts the gain values according to the operating conditions. In scenarios with stronger disturbances, the policy tends to generate smaller and more damped gains, while in nominal conditions larger gains are used to achieve faster response. This behavior indicates that the TD3 algorithm successfully learns a control strategy that balances responsiveness and stability through the reward-driven optimization process.

Despite the promising results, the present study has some limitations. First, the evaluation is performed only in a simulation environment, and real-time implementation issues such as sensor noise, actuator saturation, and computation delay are not considered. Second, the experiments are conducted on a relatively simple mass–spring–damper system, and more complex nonlinear or multi-degree-of-freedom systems may require additional investigation. Third, the reward function and training parameters were selected empirically, and different choices may affect the convergence speed and final performance.

Future work will focus on extending the proposed framework to more complex control problems, including nonlinear systems, multi-input multi-output (MIMO) systems, and real-time embedded implementations. Another important direction is the integration of safety constraints and stability guarantees into the reinforcement learning process, in order to improve reliability for practical applications. In addition, the use of alternative deep reinforcement learning algorithms and hybrid model-based learning approaches will be investigated to further enhance learning efficiency and robustness.

Overall, the results show that TD3-based adaptive PID tuning is a promising approach for intelligent control of uncertain dynamic systems and provides a flexible framework for combining classical control theory with modern reinforcement learning methods.

Looking ahead, several research directions emerge from this study. Future work includes extending the proposed framework to more complex and realistic systems, such as nonlinear and multi-degree-of-freedom dynamics, as well as real-time hardware implementations. In addition, incorporating explicit stability guarantees and safety-aware learning mechanisms, including constraint-based and safe reinforcement learning approaches, represents an important step toward practical deployment.Another promising direction is the integration of model-based and data-driven techniques to improve sample efficiency and convergence speed. Finally, exploring alternative deep reinforcement learning algorithms and adaptive reward design strategies may further enhance robustness and generalization performance in uncertain environments.

Author Contributions

Conceptualization, A.H.A.H. and U.D.; methodology, U.D. and H.B.; software, U.D.; validation, U.D. and H.B.; formal analysis, U.D. and H.B.; investigation, U.D., H.B., and B.A.; resources, A.H.A.H. and B.A.; data curation, U.D.; writing—original draft preparation, U.D.; writing—review and editing, A.H.A.H., H.B., and B.A.; visualization, U.D. and H.B.; supervision, A.H.A.H.; project administration, A.H.A.H.; funding acquisition, A.H.A.H. and B.A. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Deanship of Scientific Research, Vice Presidency for Graduate Studies and Scientific Research, King Faisal University, Saudi Arabia, under Grant KFU262654.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. (The data used to train the reinforcement learning model were generated online during training and were not stored, as retaining them offers no practical benefit).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ziegler, J.G.; Nichols, N.B. Optimum settings for automatic controllers. Trans. ASME 1942, 64, 759–768. [Google Scholar] [CrossRef]
Åström, K.J.; Hägglund, T. PID Controllers: Theory, Design, and Tuning; Instrument Society of America: Albuquerque, NM, USA, 1995. [Google Scholar]
Utami, A.R.; Yuniar, R.J.; Giyantara, A.; Saputra, A.D. Cohen–Coon PID tuning method for self-balancing robot. In 2022 International Symposium on Electronics and Smart Devices (ISESD); IEEE: Piscataway, NJ, USA, 2022. [Google Scholar]
Rivera, D.E.; Morari, M.; Skogestad, S. Internal model control: PID controller design. Ind. Eng. Chem. Process Des. Dev. 1986, 25, 252–265. [Google Scholar] [CrossRef]
De Moura Oliveira, P.B. Modern heuristics review for PID control systems optimization: A teaching experiment. In Proceedings of the 2005 International Conference on Control and Automation (ICCA); IEEE: Piscataway, NJ, USA, 2005; pp. 828–833. [Google Scholar]
Hjeij, M.; Vilks, A. A brief history of heuristics: How did research on heuristics evolve? Humanit. Soc. Sci. Commun. 2023, 10, 64. [Google Scholar] [CrossRef]
Yang, X.S. Review of meta-heuristics and generalised evolutionary walk algorithm. Int. J.-Bio-Inspired Comput. 2011, 3, 77–84. [Google Scholar] [CrossRef]
Patil, R.S.; Jadhav, S.P.; Patil, M.D. Review of intelligent and nature-inspired algorithms-based methods for tuning PID controllers in industrial applications. J. Robot. Control 2024, 5, 336–358. [Google Scholar] [CrossRef]
Jain, N.K.; Nangia, U.; Jain, J. A review of particle swarm optimization. J. Inst. Eng. India Ser. B 2018, 99, 407–411. [Google Scholar] [CrossRef]
Gani, M.M.; Islam, M.S.; Ullah, M.A. Optimal PID tuning for controlling the temperature of electric furnace by genetic algorithm. SN Appl. Sci. 2019, 1, 880. [Google Scholar] [CrossRef]
Fraga-Gonzalez, L.F.; Fuentes-Aguilar, R.Q.; Garcia-Gonzalez, A.; Sanchez-Ante, G. Adaptive simulated annealing for tuning PID controllers. AI Commun. 2017, 30, 347–362. [Google Scholar] [CrossRef]
Wang, L.; Luo, Y.; Yan, H. Ant colony optimization-based adjusted PID parameters: A proposed method. PeerJ Comput. Sci. 2023, 9, e1660. [Google Scholar] [CrossRef]
Parque, V.; Khalifa, A. PID tuning using differential evolution with success-based particle adaptations. IEEE Access 2023, 11, 136219–136268. [Google Scholar] [CrossRef]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR) 2016, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Fujimoto, S.; van Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the 35th International Conference on Machine Learning (ICML); PMLR: Cambridge, MA, USA, 2018; pp. 1587–1596. [Google Scholar]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning (ICML); PMLR: Cambridge, MA, USA, 2018; pp. 1861–1870. [Google Scholar]
Dogru, O.; Tokatli, N.E. Reinforcement learning approach to autonomous PID tuning. Comput. Chem. Eng. 2022, 166, 107964. [Google Scholar]
Sharifi, I.; Alasty, A. Self-tuning PID control via a hybrid actor-critic-based neural structure for quadcopter control. arXiv 2023, arXiv:2307.01312. [Google Scholar] [CrossRef]
Jesawada, H.; Vullikanti, A.S.; Novitzky, M.A. A model-based reinforcement learning approach for PID controller tuning. arXiv 2022, arXiv:2206.03567. [Google Scholar] [CrossRef]
Bujgoi, G.; Sendrescu, D. Tuning of PID controllers using reinforcement learning for nonlinear system control. Processes 2025, 13, 735. [Google Scholar] [CrossRef]
Shuprajhaa, T.; Kamalan, V.; Padma, S. Reinforcement learning based adaptive PID controller design for unstable nonlinear processes. Appl. Soft Comput. 2022, 128, 109418. [Google Scholar] [CrossRef]
van Niekerk, J.A.; Craig, I.K.; Nelwamondo, F.V. Reinforcement learning based automatic tuning of PID controllers for grinding mill circuits. Control Eng. Pract. 2025, 165, 106522. [Google Scholar] [CrossRef]
Carlucho, I.; De Paula, M.; Villar, S.A.; Acosta, G.G. Incremental Q-learning strategy for adaptive PID control of mobile robots. Expert Syst. Appl. 2017, 80, 183–199. [Google Scholar] [CrossRef]
Carlucho, I.; De Paula, M.; Acosta, G.G. An adaptive deep reinforcement learning approach for MIMO PID control of mobile robots. ISA Trans. 2020, 102, 280–294. [Google Scholar] [CrossRef]
Chowdhury, M.A.; Al-Wahaibi, S.S.S.; Lu, Q. Entropy-maximizing TD3-based reinforcement learning for adaptive PID control of dynamical systems. Comput. Chem. Eng. 2023, 178, 108393. [Google Scholar] [CrossRef]
Chowdhury, M.A.; Lu, Q. A novel entropy-maximizing TD3-based reinforcement learning for automatic PID tuning. In Proceedings of the 2023 American Control Conference (ACC); IEEE: Piscataway, NJ, USA, 2023; pp. 2763–2768. [Google Scholar] [CrossRef]
Ogata, K. Modern Control Engineering, 5th ed.; Pearson: London, UK, 2010. [Google Scholar]
Borase, R.P.; Maghade, D.K.; Sondkar, S.Y.; Pawar, S.N. A review of PID control, tuning methods and applications. Int. J. Dyn. Control 2021, 9, 818–827. [Google Scholar] [CrossRef]
Somefun, O.A.; Akingbade, K.; Dahunsi, F. The dilemma of PID tuning. Annu. Rev. Control 2021, 52, 65–74. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Gu, S.; Yang, L.; Du, Y.; Chen, G.; Walter, F.; Wang, J.; Knoll, A. A review of safe reinforcement learning: Methods, theory and applications. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 11216–11235. [Google Scholar] [CrossRef] [PubMed]
Kushwaha, D.S.; Biron, Z.A. A review on safe reinforcement learning using Lyapunov and barrier functions. arXiv 2025, arXiv:2508.09128. [Google Scholar] [CrossRef]
Lee, H.; Han, Y.; Kim, Y.; Kim, Y.H. Effects analysis of reward functions on reinforcement learning for traffic signal control. PLoS ONE 2022, 17, e0277813. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Conceptual overviewof the proposed TD3-based adaptive control framework, integrating PID-informed state representations for enhanced tracking performance and robustness.

Figure 2. Internal architecture of the TD3-PID controller. The diagram illustrates the interaction between the PID-informed actor, twin-critic evaluation, and the robust update mechanisms of the TD3 algorithm.

Figure 3. Training progress of the TD3 agent under external disturbance conditions. Episode reward, average reward, and predicted Q-value are shown across training episodes.

Figure 4. Constant reference tracking under 10% external disturbance using the trained TD3-based adaptive PID controller.

Figure 5. Sinusoidal reference tracking performance under external disturbance using the learned PID gains.

Figure 6. Training progress of the TD3 agent under parametric uncertainty conditions.

Figure 7. Constant reference tracking under

\pm 10 %

mass variation using the adaptive PID controller.

Figure 7. Constant reference tracking under

\pm 10 %

mass variation using the adaptive PID controller.

Figure 8. Sinusoidal reference tracking under parametric uncertainty.

Figure 9. Training progress of the TD3 agent under combined internal and external disturbances.

Figure 10. Constant reference tracking under both disturbance and uncertainty.

Figure 11. Sinusoidal reference tracking under simultaneous disturbance and uncertainty.

Table 1. Notation used in the paper.

Symbol	Description
m	Mass of the system
c	Damping coefficient
k	Spring constant
$x (t)$	Position of the mass
$\dot{x} (t)$	Velocity of the mass
$\ddot{x} (t)$	Acceleration of the mass
$r (t)$	Reference signal
$e (t)$	Tracking error, $e (t) = r (t) - x (t)$
$u (t)$	Control input force
$d (t)$	External disturbance force
$K_{p}$	Proportional gain of PID controller
$K_{i}$	Integral gain of PID controller
$K_{d}$	Derivative gain of PID controller
$s_{t}$	State vector at time step t
$a_{t}$	Action (Control input)
$π_{θ}$	Policy function parameterized by $θ$
$r_{t}$	Reward at time step t
$γ$	Discount factor in reinforcement learning
$Q (s, a)$	Action-value function
$θ$	Actor network parameters
$ϕ$	Critic network parameters
$α$	Weight of error term in reward
$β$	Weight of control effort in reward
MDP	Markov Decision Process
TD3	Twin Delayed Deep Deterministic Policy Gradient
DRL	Deep Reinforcement Learning
PID	Proportional–Integral–Derivative controller

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Demircioğlu, U.; Bakır, H.; Almarri, B.; Abdul Hafez, A.H. TD3-Based Reinforcement Learning for Adaptive PID-like Control of Uncertain Dynamical Systems. Mathematics 2026, 14, 1744. https://doi.org/10.3390/math14101744

AMA Style

Demircioğlu U, Bakır H, Almarri B, Abdul Hafez AH. TD3-Based Reinforcement Learning for Adaptive PID-like Control of Uncertain Dynamical Systems. Mathematics. 2026; 14(10):1744. https://doi.org/10.3390/math14101744

Chicago/Turabian Style

Demircioğlu, Ufuk, Halit Bakır, Badar Almarri, and A. H. Abdul Hafez. 2026. "TD3-Based Reinforcement Learning for Adaptive PID-like Control of Uncertain Dynamical Systems" Mathematics 14, no. 10: 1744. https://doi.org/10.3390/math14101744

APA Style

Demircioğlu, U., Bakır, H., Almarri, B., & Abdul Hafez, A. H. (2026). TD3-Based Reinforcement Learning for Adaptive PID-like Control of Uncertain Dynamical Systems. Mathematics, 14(10), 1744. https://doi.org/10.3390/math14101744

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

TD3-Based Reinforcement Learning for Adaptive PID-like Control of Uncertain Dynamical Systems

Abstract

1. Introduction

Motivation, Contributions, and Novelty

2. Background and Methodology

2.1. Mass–Spring–Damper System Model

2.2. PID Control Structure

3. Reinforcement Learning Formulation

3.1. Reward Function

3.2. TD3-Based Adaptive Control Architecture

3.3. Actor Network

3.4. Critic Networks (Twin Q-Networks)

3.5. Target Networks

3.6. Target Policy Smoothing

3.7. Delayed Policy Updates

4. Training Algorithm

4.1. Training Algorithm and Hyperparameter Configuration

4.1.1. Actor Network Implementation

4.1.2. Critic Networks

4.1.3. Replay Buffer and Mini-Batch Sampling

4.1.4. Optimization Settings

4.1.5. Implementation of Target Policy Smoothing

4.1.6. Training Duration and Stopping Criteria

4.1.7. Simulation After Training

4.2. Stability and Convergence Considerations

5. Results

5.1. Simulation Environment

5.2. PID Tuning Under External Disturbances

5.3. PID Tuning Under Internal Disturbances

5.4. PID Tuning Under Both Internal and External Disturbances

6. Limitations and Future Works

7. Stability Considerations and Safe Deployment

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI