Soft Fuzzy Reinforcement Neural Network Proportional–Derivative Controller

Han, Qiang; Boussaid, Farid; Bennamoun, Mohammed

doi:10.3390/app15095071

Open AccessArticle

Soft Fuzzy Reinforcement Neural Network Proportional–Derivative Controller

by

Qiang Han

^1,*,

Farid Boussaid

¹ and

Mohammed Bennamoun

²

¹

Department of Electrical, Electronic & Computer Engineering, The University of Western Australia, Perth 6009, Australia

²

Department of Computer Science and Software Engineering, The University of Western Australia, Perth 6009, Australia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(9), 5071; https://doi.org/10.3390/app15095071

Submission received: 21 March 2025 / Revised: 16 April 2025 / Accepted: 17 April 2025 / Published: 2 May 2025

Download

Browse Figures

Versions Notes

Abstract

:

Controlling systems with highly nonlinear or uncertain dynamics present significant challenges, particularly when using conventional Proportional–Integral–Derivative (PID) controllers, as they can be difficult to tune. While PID controllers can be adapted for such systems using advanced tuning methods, they often struggle with lag and instability due to their integral action. In contrast, fuzzy Proportional–Derivative (PD) controllers offer a more responsive alternative by eliminating reliance on error accumulation and enabling rule-based adaptability. However, their industrial adoption remains limited due to challenges in manual rule design. To overcome this limitation, Fuzzy Neural Networks (FNNs) integrate neural networks with fuzzy logic, enabling self-learning and reducing reliance on manually crafted rules. However, most fuzzy neural network PD (FNNPD) controllers rely on mean square error (MSE)-based training, which can be inefficient and unstable in complex, dynamic systems. To address these challenges, this paper presents a Soft Fuzzy Reinforcement Neural Network PD (SFPD) controller, integrating the Soft Actor–Critic (SAC) framework into FNNPD control to improve training speed and stability. While the actor–critic framework is widely used in reinforcement learning, its application to FNNPD controllers has been unexplored. The proposed controller leverages reinforcement learning to autonomously adjust parameters, eliminating the need for manual tuning. Additionally, entropy-regularized stochastic exploration enhances learning efficiency. It can operate with or without expert knowledge, leveraging neural network-driven adaptation. While expert input is not required, its inclusion accelerates convergence and improves initial performance. Experimental results show that the proposed SFPD controller achieves fast learning, superior control performance, and strong robustness to noise, making it effective for complex control tasks.

Keywords:

deep reinforcement learning; soft actor–critic; fuzzy neural networks; PD control

1. Introduction

Proportional–Integral–Derivative (PID) controllers [1] are widely used in the field of control systems. However, their control performance is suboptimal for strongly nonlinear, high-order, and hysteresis systems [2]. Especially in the presence of noise, traditional control systems can face challenges in maintaining optimal performance. To address these issues, various advanced technologies have emerged. Among these, fuzzy PID controllers that integrate fuzzy logic have gained widespread applications (e.g., robotic trajectory tracking [3], brushless DC motors [4]) since their emergence [5,6]. These model-free controllers provide improved handling of nonlinearities, enhanced robustness, and greater flexibility compared to traditional PID controllers [7]. However, fuzzy PID controllers rely on fixed parameters derived from human expertise, making them less adaptable to system variations and highly dependent on prior knowledge [8].

While the integral term in PID controllers helps reduce steady-state errors, it also introduces lag and increases computational complexity. Unlike PID controllers, which accumulate past errors through integral action, Proportional–Derivative (PD) controllers [9] focus solely on the current error and its rate of change, allowing for quicker adaptation to dynamic system variations. However, their fixed parameters still limit adaptability in highly nonlinear or uncertain environments. Fuzzy PD controllers present an alternative to traditional PD controllers, providing greater flexibility while preserving the benefits of PD control. Their simpler fuzzy rule design [10] allows for easier implementation, while their rule-based adaptability enables them to deliver performance comparable to fuzzy PID controllers [11]. By leveraging fuzzy logic, they improve responsiveness and adaptability, making them well suited for applications that require fast reaction times without the drawbacks of integral action [12,13].

However, they still rely on rule-based methods, which can become inadequate in highly complex or uncertain environments. To enhance performance in such challenging scenarios, reinforcement learning (RL) algorithms, such as Soft Actor–Critic (SAC) [14], offer a model-free approach that learns optimal control policies directly through interaction with the environment. Unlike traditional model-based control methods, RL algorithms like SAC do not require an explicit system model, making it more adaptable to complex and dynamic environments. SAC, in particular, employs entropy maximization to promote exploration and prevent premature convergence, effectively balancing exploration and exploitation, which makes it a powerful tool for continuous control tasks. It also aims to maximize the cumulative reward (CR) while maintaining this balance. This differentiates it from classical control approaches, which typically rely on predefined control laws and system models. By leveraging an actor–critic framework and experience replay, SAC enhances sample efficiency and stability, making it well suited for complex, high-dimensional control problems.

Several studies have explored the integration of Soft Actor–Critic (SAC) with PID control, leveraging the replay buffer to enhance sample efficiency and improve system performance. He et al. [15] applied SAC-based deep reinforcement learning for online PID tuning in hydraulic servo control systems, showcasing its effectiveness in real-time applications by improving the system’s responsiveness and stability. Yu et al. [16] proposed a self-adaptive SAC-PID control approach for mobile robots, emphasizing its capability to dynamically adjust to unpredictable environments. Song et al. [17] investigated SAC-based PID parameter tuning for unmanned surface vehicle (USV) path following, optimizing control performance through automated and real-time tuning. While these approaches take advantage of SAC’s model-free adaptability, they must also address a major challenge in reinforcement learning (RL), experience replay, which stores and reuses past transitions to improve sample efficiency [18]. More specifically, in PID controllers, the integral term introduces a dependency on historical errors, which conflicts with the Markov property and can lead to unstable learning dynamics when combined with experience replay. While training with PID controllers and experience replay is not inherently incapable of convergence, it tends to exhibit instability, with successful convergence occurring inconsistently across multiple training attempts [19]. In contrast, fuzzy PD controllers do not suffer from this issue, as they rely solely on the current error and its derivative.

Existing studies on auto-tuning fuzzy PD controllers using neural networks and reinforcement learning present several limitations. Some approaches utilize Q-learning to adjust fuzzy PD parameters [20], but without a replay buffer, the learning process is inefficient and less effective compared to SAC. Additionally, certain methods employ neural networks as a compensation mechanism for PD controllers [21,22], requiring a predefined PD controller while still lacking a replay buffer, which limits sample efficiency. Other studies explored the use of neural networks to tune fuzzy rule parameters through supervised learning, necessitating pre-collected training data [23]. However, obtaining high-quality training data can be challenging, as it often requires extensive expert knowledge, precise system modeling, or costly real-world experiments. Additionally, the collected data may not fully represent the system’s operating conditions, leading to poor generalization when applied to new scenarios. Reddy K. H. [24] combined neural networks with fuzzy PD controllers to control the speed of BLDC motors, using the mean square error (MSE) for training without a critic network or experience replay, leading to suboptimal learning. Similarly, McCutcheon L. [25] applied model-based reinforcement learning with a State-Buffer State Prediction (SBSP) framework to auto-tune PD parameters for systems with large time delays. However, their approach employed a fully connected neural network within RL to tune the parameters of a fuzzy PD controller, but this is less effective than using a fuzzy neural network (FNN) [26], which is inherently better at capturing nonlinear relationships and adapting to system variations. While fuzzy logic itself can incorporate expert knowledge, relying solely on a standard neural network for parameter tuning limits its effectiveness, leading to poor initial performance and slow convergence. Moreover, FNNs offer a degree of interpretability by explicitly representing fuzzy rules and membership functions, making it easier to understand and analyze the decision-making process compared to standard neural networks [27]. These limitations highlight the need for a more effective approach that integrates fuzzy logic, neural networks, and reinforcement learning within a structured framework, ensuring better adaptability, interpretability, and sample efficiency for fuzzy PD controller optimization.

To the best of our knowledge, no studies have leveraged the SAC framework for fuzzy neural network PD controllers. This paper explores this novel application, marking the first known attempt to integrate SAC with fuzzy neural network PD controllers. We propose employing a fuzzy neural network to automatically adjust the PD controller’s parameters. Through reinforcement learning, the fuzzy neural network’s parameters are learned and adapted autonomously. By leveraging an actor–critic architecture, the parameter adjustments of the fuzzy neural network are guided by the critic network. The backpropagation algorithm is utilized to fine-tune the fuzzy neural network parameters, enabling the PD controller to achieve the desired performance. To enhance exploration efficiency, a stochastic output is employed during training.

In summary, this paper makes the following contributions:

A model-free SAC framework for fuzzy neural network PD controllers that uses reinforcement learning to automatically adjust fuzzy rules for optimal control.
Incorporation of expert knowledge: the partial interpretability of FNNs allows expert knowledge to be included during initialization, enabling faster convergence early in training.
Enhanced training efficiency: the controller uses stochastic optimization and integrates entropy during training, improving exploration efficiency and significantly reducing training time.
Superior performance: experimental results show that the SFPD controller achieves faster convergence and superior control performance compared to both SAC and traditional PD controllers.

2. Preliminaries

2.1. Reinforcement Learning (RL)

RL is a model-free approach where the controller continuously interacts with the environment to receive feedback and iteratively learns control strategies [28]. The goal of RL is to maximize the cumulative reward (CR). Based on the current state, the RL agent selects an action, which the environment then responds to by providing a reward and transitioning to the next state. Through this cyclic training process, the agent learns to determine the optimal actions that yield the highest CR given the current state.

As shown in Figure 1, the agent and the environment together constitute the RL framework. The agent is determined by the policy,

π

, which is a mapping from the state space S to the action space A (

π : S \to A

). At each time step, the policy selects an action (

a_{t}

) based on the current state (

s_{t}

), which is then applied to the environment to obtain a reward (

r_{t}

). The goal of RL is to find an optimal policy, denoted as

π^{*}

, that maximizes the cumulative reward (CR) over an epoch.

However, relying solely on the current reward r to train the policy makes it challenging to achieve the maximum CR because we need to consider long-term rewards, not just the immediate reward (

r_{t}

). This is where the critic network comes into play, evaluating the quality of the current state–action pair. This forms the basis of the actor–critic algorithm. Among these, the Soft Actor–Critic (SAC) [29] algorithm stands out due to its efficient exploration–exploitation trade-off, off-policy learning, and superior stability in continuous action spaces.

The SAC algorithm uses an experience replay buffer to store tuples of (state, action, reward, next state). It resamples batches of these tuples from the buffer and trains the neural network using stochastic gradient descent (SGD) [30]. Additionally, the SAC algorithm incorporates entropy into the exploration process to balance exploration and exploitation, which introduces stochasticity into the decision-making process. Consequently, the cost function of the policy in SAC is:

J (π) = E [R (τ) + α H (π)]

(1)

where

$J (π)$ : the objective function of the policy network.
$E [R (τ)]$ : the expected return of a trajectory.
$α$ : a hyperparameter used to balance the weights of return and entropy.
$H (π)$ : the entropy of the policy network distribution.

SAC is an actor–critic architecture with a critic network that consists of a state–action value (Q-value) network and a state value (V-value) network. The objective functions for these networks involve maximizing the CR and entropy, leading to what is known as the soft Q-value function and the soft V-value function. The objective functions for these networks involve maximizing the CR and entropy, leading to what is known as the soft Q-value function and the soft V-value function.

Q_{θ}^{π} (s, a) = E_{s_{t}, a_{t} \sim ρ_{π}} [\sum_{t = 0}^{\infty} γ^{t} r (s_{t}, a_{t}) + α \sum_{t = 1}^{\infty} γ^{t} H (π (\cdot | s_{t})) | s_{0} = s, a_{0} = a]

(2)

V_{ψ} (s_{t}) = E_{s_{t}, a_{t} \sim ρ_{π}} [Q_{θ}^{π} (s, a) - α log π_{φ} (a_{t} | s_{t})]

(3)

where

γ

is a hyperparameter known as the discount factor, and

α

is the temperature factor. Based on the Bellman equation and (1)–(3), the loss functions for the policy, Q-value, and V-value can be derived as follows:

J_{π} (φ) = E_{s_{t} \sim D, ϵ \sim N} [\frac{1}{2} {(α log π_{φ} (f_{φ} (ϵ_{t}, s_{t}) ∣ s_{t}) - Q_{θ} (s_{t}, f_{φ} (ϵ_{t}, s_{t})))}^{2}]

(4)

J_{Q} (θ) = E_{(s_{t}, a_{t}, s_{t + 1}) \sim D, a_{t} \sim ρ_{π}} [\frac{1}{2} {(Q_{θ} (s_{t}, a_{t}) - (r + γ Q_{θ} (s_{t + 1}, a_{t + 1})))}^{2}]

(5)

J_{V} (ψ) = E_{(s_{t}, a_{t}, s_{t + 1}) \sim D, a_{t} \sim ρ_{π}} [\frac{1}{2} {(V_{ψ} (s_{t}) - E_{s_{t}, a_{t} \sim ρ_{π}} [Q_{θ}^{π} (s_{t + 1}, a_{t + 1}) - α log π_{φ} (a_{t} | s_{t})])}^{2}]

(6)

In (4),

f_{φ} (ϵ_{t}, s_{t})

employs the reparameterization trick [31] as shown in (7). This technique ensures the differentiability of the policy function while expanding the exploration range.

f_{φ} (ϵ_{t}, s_{t}) = f_{φ}^{μ} (s_{t}) + ϵ f_{φ}^{σ} (s_{t})

(7)

where

$f_{φ}^{μ} (s_{t})$ : the output of the policy network, which serves as the mean of a Gaussian distribution.
$f_{φ}^{σ} (s_{t})$ : the standard deviation of the Gaussian distribution.
$ϵ$ : a sample from the standard normal Gaussian distribution.

2.2. Fuzzy Neural Network PD (FNNPD) Controller

A fuzzy logic system (FLS) [32] is a control system that utilizes expert knowledge and fuzzy logic. The fuzzy control system consists of fuzzification interface, inference engine (including database and rule base), and defuzzification interface, as shown in Figure 2. With fuzzy rules designed based on expert experience, an FLS can effectively identify and control systems [33]. However, the quality of control heavily relies on the expert-designed fuzzy rules.

A fuzzy neural network (FNN) [34] combines the learning capabilities of neural networks with the advantages of an FLS. Its structure is depicted in Figure 3, where the FNN outputs the coefficients for a PD controller, using the system error e and the derivative error de as inputs. With seven membership functions per input variable, two input variables generate a total of 49 (7 × 7) fuzzy rules.

As shown in Figure 3, FNNs have basically five layers: the input layer, membership layer, rule layer, normalization layer, and output layer [35]. The membership layer transfers the input signal to fuzzy input sets according to fuzzy memberships. In the rule layer, each neuron represents a fuzzy rule, which transfers the fuzzy input sets to fuzzy output sets. The normalization layer conducts normalization operations. Finally, the output layer transfers the fuzzy output set into crisp output.

The traditional PD controller [36] is a classical controller widely used in industry due to its simple structure and effective control performance. The formula for the PD controller is as follows:

u = K_{p} \times e + K_{d} \times \dot{e}

(8)

where

K_{p}

and

K_{d}

are PD parameters, e represents the system error, while

\dot{e}

represents the rate of error change.

However, its effectiveness diminishes in nonlinear systems, and adjusting PD parameters can be challenging. Combining fuzzy neural networks with the PD controller forms the fuzzy neural network PD controller (FNNPD) [20], which exhibits significant advantages. By utilizing a fuzzy neural network, the PD parameters can be automatically adjusted based on the system’s current state. It not only has learning capabilities and utilizes expert knowledge but also retains the robustness of the PD controller.

3. Proposed Soft Fuzzy Reinforcement Neural Network PD (SFPD) Controller

The proposed SFPD method in this paper is built on the SAC framework, with the policy network employing an FNNPD controller. Leveraging SAC enhances learning speed and stability. While FNNs enable the integration of expert knowledge through fuzzy rules, the proposed method starts with an imperfect fuzzy rule as a foundation and refines it through learning, demonstrating the controller’s ability to effectively enhance expert knowledge. The system diagram is shown in Figure 4.

3.1. Forward Propagation

The lower part of Figure 4 depicts the interaction between SFPD and the environment, representing the forward propagation. Based on the current state

S_{t}

, FNNs provide the PD control parameters. Each PD parameter, along with the standard deviation provided by standard neural networks (Std NNs), is then transformed into the final PD controller using the reparameterization technique. The PD controller subsequently calculates the current action a, which is applied to the environment (plant). The plant provides a reward, and the process continues to the next state

S_{t + 1}

. The loop keeps going until the current epoch ends. The tuple (

S_{t}

, a, r,

S_{t + 1}

) is stored in the replay buffer for parameter updates. Inputs are processed through tanh before being passed to the neural network to enhance convergence by normalizing values, maintaining a zero-centered distribution, and improving gradient flow.

In FNNs, the first layer is the input layer, which receives the inputs x. The tanh transformation is applied before training, as a preprocessing step to normalize the inputs into the range

[- 1, 1]

, which facilitates stable and efficient convergence. While tanh is not part of the model architecture and does not affect gradient flow during training, this bounded and smooth transformation helps reduce the impact of outliers and brings the input data into a scale that aligns well with the model’s initialization and training dynamics, thereby improving stability and robustness.

x = tanh ([e, e^{'}])

(9)

The second layer is the fuzzification layer and consists of fuzzy memberships. Each neuron represents one membership. The Gaussian function is used to transform the input into the fuzzy sets, as shown below [37]:

μ_{i}^{j} = e^{- \frac{{(x_{i} - c_{i j})}^{2}}{σ_{i j}^{2}}}

(10)

where

x_{i}

is the input, and

c_{i j}, σ_{i j}

represent the center and width of the distribution. The notation

i = 2, j = 7

indicates that there are two crisp inputs, each associated with seven fuzzy memberships. The partial interpretability of Fuzzy Neural Networks (FNNs) enables the integration of expert knowledge during the initialization phase. By setting initial parameters such as the centers and widths of the Gaussian membership functions based on domain expertise, the model can achieve a more informed starting point. This approach leads to faster convergence early in the training process, as the network starts with parameters that are more aligned with the underlying system behavior.

The third layer is the multiplication of all the input fuzzy sets, which represents the fuzzy inference:

α_{j} = μ_{1}^{i_{1}} μ_{1}^{i_{2}} \dots μ_{1}^{i_{n}}

(11)

where

n = 2

represents the fuzzy membership functions associated with this rule:

The fourth layer is the normalization layer.

β_{j} = \frac{α_{j}}{\sum_{j = 1}^{m} α_{j}}, j = 1, 2, \dots, m

(12)

where

m = j i = 49

indicates that there are 49 fuzzy rules.

The fifth layer is the defuzzification layer:

y_{i} = \sum_{j = 1}^{r} ω_{i j} β_{j}, j = 1, 2, \dots, m

(13)

where

y_{i}

includes the two parameters of the PD controller,

P^{'}

and

D^{'}

, and

ω_{i j}

indicates the connection weight. Subsequently,

P^{'}

and

D^{'}

are reparameterized with the outputs from the standard neural network to obtain the final outputs P and D of the PD controller:

P = P^{'} + ϵ f_{ϕ}^{P} (s_{t})

(14)

D = D^{'} + ϵ f_{ϕ}^{D} (s_{t})

(15)

where

ϕ

denotes the parameters of the standard neural networks.

Then, the PD controller computes the action output using (8).

3.2. Backward Propagation

The upper part of Figure 4 represents the update process, specifically the backpropagation phase. Backpropagation is a key algorithm in training neural networks, where the model’s error is propagated backward through the network. During this phase, the gradients of the loss function with respect to each weight are calculated, allowing the model to update its weights to minimize the error. This iterative process helps the network learn from the data and improve its performance over time. Initially, data

(s_{t}, a_{t}, r, s_{t + 1})

of batch size are retrieved from the replay buffer to update various network parameters. For enhanced stability during training, this study employed two state–action value neural networks (QNNs), using the minimum of their outputs,

min (Q_{1}, Q_{2})

, and incorporated a target state value network (VNNs). The QNNs were updated using the Bellman equation as shown in (5). The loss function for the VNNs is depicted in (6), where the calculation of

Q_{θ}^{π} (s_{t + 1}, a_{t + 1})

selects the minimum output between the two QNNs, while

log π_{ϕ} (a_{t} | s_{t})

represents the sum of the logarithms of the parameters P and D output by the Std NNs, as shown below:

log π_{ϕ} (a_{t} | s_{t}) = log π_{ϕ} f_{ϕ}^{P} (s_{t}) + log π_{ϕ} f_{ϕ}^{D} (s_{t})

(16)

The loss function for FNNs and the Std NNs, as shown in (4), similarly involves selecting the minimum Q value and incorporating the aforementioned logarithm.

The forward and backward propagation are summarized in Algorithm 1.

Algorithm 1 SFPD

1:: Initialize all the NNs, including $Q_{1}$ , $Q_{2}$ , V, FNNs, Std NNs.
2:: Initialize target VNNs, with $θ_{v^{'}} \leftarrow θ_{v}$
3:: Initialize the parameters $ω_{i j}$ , $c_{i j}$ , and $σ_{i j}$ in FNNs.
4:: Set a replay buffer R
5:: for episodes $k = 1$ to E
6:: while this episode is not completed or the maximum steps >N
7:: Obtain an action $a (t)$ from the PD controller, with parameters derived from the FNNs and Std NNs based on the current state s(t), according to the forward propagation
8:: Apply action $a (t)$ to interact with the environment, obtaining the next state $s (t + 1)$ and reward $r (t)$
9:: Put the $(s (t), a (t), s (t + 1), r (t))$ into the replay buffer

Updating part:

10:: Sample N samples from the replay buffer R
11:: for all the N samples
12:: Update two QNNs, $Q_{1}$ and $Q_{2}$ based on Bellman’s equation according to (5)
13:: Based on the sampled states $s (t)$ , calculate the action $a (t)$ from the FNNs and the log probability from Std NNs
14:: Based on the states $s (t)$ and actions $a (t)$ , calculate $Q_{1}$ , $Q_{2}$ , and $\min_{q} (\min (Q_{1}, Q_{2}))$
15:: Update the VNNs based on the loss function as described in (6)
16:: Update all the parameters $ω_{i j}$ , $c_{i j}$ , and $σ_{i j}$ in FNNs according to the loss function, represented by (4)
17:: Update all the parameters $φ$ in Std NNs according to the loss function, represented by (4)
18:: Update target VNNs, $θ_{v^{'}} \leftarrow θ_{v}$
19:: end for
20:: end while
21:: end for

4. Results and Discussion

4.1. Experimental Setup

Our algorithm was tested on two systems: the pendulum and the Continuous Stirred Tank Reactor (CSTR). The pendulum was chosen because it is a classic nonlinear system with unstable dynamics, making it a standard benchmark for control algorithms. The CSTR was selected due to its large control input space, requiring extensive exploration, and its strong nonlinear characteristics, which pose significant control challenges. These two systems ensured our algorithm was tested in both mechanical and process control domains. The SFPD controller used in the simulations was implemented in Python 3.9. The pendulum system leveraged the gym-pendulum-v1 environment from Python’s Gym library, while the CSTR model was built using MATLAB/Simulink R2023a. The simulations were performed on a personal computer with an Intel(R) Core(TM) i7-12700H processor (2.30 GHz), 64 GB of RAM, and a 64-bit operating system. The proposed controller exhibited low memory usage (peak

< 0.4 MB

), demonstrating its memory efficiency. While edge deployment was not the focus of this study, previous works (e.g., Prado et al. [38]) have demonstrated the feasibility of similar lightweight controllers on embedded hardware such as FPGAs.

The commonly used metrics IAE (Integral of Absolute Error) and ISE (Integral of Squared Error) were adopted to evaluate the performance of the control system [39]. These metrics help quantify the deviation between the desired output (setpoint) and the actual system response. The definitions of IAE and ISE are shown in (17).

\begin{matrix} I A E = \sum_{k = 0}^{n} | e (k) | \\ I S E = \sum_{k = 0}^{n} e^{2} (k) \end{matrix}

(17)

where n denotes the total number of steps, and

e (k)

represents the system error at step k.

The fuzzy neural network utilized seven fuzzy memberships [40] for each input, leading to seven means and standard deviations for both the error and the error rate, resulting in a total of 28 trainable parameters in the membership layer. The choice of seven memberships ensured a balance between accuracy and computational efficiency, providing sufficient granularity to capture nonlinear relationships while avoiding excessive complexity. Seven memberships were sufficient to capture the underlying nonlinear relationships in the data, offering adequate granularity without introducing excessive model complexity. This ensured the network could generalize well while keeping computational costs manageable. Additionally, a fully combinatorial fuzzy rule design (also known as the Rule Cartesian Product) was adopted, where each of the seven input membership functions for the two input variables was paired to form

7 \times 7 = 49

fuzzy rules. These rules were directly mapped to 49 corresponding weights in the defuzzification layer, enhancing the network’s generalization ability while keeping computational costs low. To assess the robustness of the method, we initialized the centers (means) of the Gaussian membership functions uniformly within their respective value ranges, while the variances were fixed at 0.5, ensuring that the initialization was adaptable across different systems. Table 1 summarizes the hyperparameter settings.

The comparison methods in this paper included SAC and traditional PD algorithms. SAC was chosen as a baseline since the proposed method built upon it, enabling a direct comparison to highlight improvements in convergence and exploration efficiency in large action spaces. The traditional PD algorithm served as a benchmark to demonstrate the benefits of dynamic PD parameter adjustment, which enhances adaptability and robustness while eliminating the need for manual tuning [15].

4.1.1. Case 1: Pendulum-v1

The pendulum is a classic test environment where one end is fixed, and the other end can move freely. By applying appropriate torque, the goal is to balance the pendulum upright and maintain its position as shown in Figure 5.

System Model

The following equations represent the pendulum system model:

\ddot{θ} = \frac{3 g}{2 l} sin (θ) + \frac{3}{m l^{2}} u

(18)

where

θ

represents the angle of the pendulum. The angular velocity and angle of the pendulum are obtained through integration. The parameters for the pendulum system are detailed in Table 2.

Reward Function

The reward function was designed to guide the agent toward the desired behavior. It was formulated as follows:

r = θ^{2} + 0.1 {\dot{θ}}^{2} + 0.01 u^{2}

(19)

where

θ

is the angle of the pendulum relative to the vertical position,

\dot{θ}

is the velocity, and u is the control input. The reward function was designed to balance stability and control efficiency. It penalized deviations from the desired upright position, high velocities, and excessive control efforts. By discouraging large angle deviations, fast movements, and unnecessary control actions, the function guided the agent to maintain the pendulum in a stable position with minimal control effort, ensuring efficient and effective performance.

4.1.2. Case 2: CSTR System

The Continuous Stirred Tank Reactor (CSTR) is widely used in industrial applications. Due to its high degree of nonlinearity, it imposes stringent requirements on controllers. This paper tested the effectiveness of the proposed method on a CSTR model.

System Model

In a CSTR model, the jacket temperature is the control variable, the reactor temperature is the process variable, and the feed temperature is the disturbance variable. The system equations for the CSTR are as follows [41]. The energy conservation equation within the vessel is

\frac{d C_{A}}{d t} = \frac{F}{V} (C_{A}^{f} (t) - C_{A} (t)) - r (t)

(20)

where

r (t) = k_{0} e^{- \frac{E}{R T (t)}} C_{A} (t)

(21)

where E represents the activation energy, R stands for the Boltzmann ideal gas constant, T denotes the temperature inside the reactor, and

k_{0}

is an unspecified nonthermal constant.

The change in temperature within the vessel is

\frac{d T (t)}{d t} = \frac{F}{V} (T_{f} (t) - T (t)) - \frac{Δ H}{ρ C_{p}} r (t) - \frac{U A}{ρ C_{p} V} (T (t) - T_{c} (t))

(22)

where

$Δ H$ represents the heat of the reaction per mole;
$C_{p}$ denotes the heat capacity coefficient;
$ρ$ stands for the density coefficient;
U represents the overall heat transfer coefficient;
A denotes the area for heat exchange, specifically the interface area between the coolant and vessel.

The initial value of

C_{A}^{f} (t)

was constant at 1 kcal/m³,

T_{f} (t)

was constant at 350 K, and

T_{c} (0)

= 50 K with the range

50 \leq T_{c} \leq 420 K

. The values of the various parameters are shown in Table 3.

Reward Function

The reward function for the CSTR (Continuous Stirred Tank Reactor) system was designed to guide the system towards minimizing the error e while preventing large gradients during the learning process. It was defined as follows:

r = \{\begin{matrix} - β e^{2} + 1, & if | e | < 0.01 \\ - β e^{2}, & otherwise \end{matrix}

(23)

where

β = 0.0025

is a small scale factor that ensures the penalty does not become too large, helping to prevent gradient explosion during training. When the error

| e |

is small (i.e., less than 0.01), the function provides a positive reward of

1 - β e^{2}

to encourage the system to stay close to the desired state. This positive reward helps the system to fine-tune its performance and maintain small errors. However, for larger errors, the reward becomes negative and increases quadratically with the error. This quadratic increase discourages significant deviations from the target, promoting convergence toward the desired state and penalizing large errors. By balancing small positive rewards for minor errors and strong penalties for large errors, the function ensures the system remains stable and learns efficiently without excessive fluctuations or instability.

4.2. Experimental Results

This section presents the test results and analysis conducted on the Pendulum-v1 and CSTR models, demonstrating the effectiveness and performance of the proposed approach.

4.2.1. Case 1: Pendulum-v1

Training Performance Evaluation

Figure 6 represents the training processes of SFPD and SAC, evaluated by the cumulative reward (CR) per episode [42]. The definition of cumulative reward (CR) is as follows:

C R = \sum_{t = 1}^{T} r (t)

(24)

The training process spanned 10 epochs. The policy network parameters for SAC were randomly initialized, while the fuzzy membership initialization parameters for SFPD’s FNNs were uniformly distributed within values.

Figure 6a shows the variation in cumulative rewards with training episodes for two algorithms (SFPD and SAC) in the pendulum control task. It suggested that the SFPD algorithm improved cumulative rewards rapidly during the early episodes, with a significantly faster convergence speed than the SAC algorithm. Additionally, the cumulative rewards of SFPD exhibited greater stability, with almost no significant regressions. In contrast, the SAC algorithm experienced significant fluctuations during the early stage and showed noticeable negative trends in the mid-training phase (around episodes 50 to 70), indicating potential instability in its policy updates. Overall, SFPD outperformed SAC, demonstrating faster convergence and higher stability.

Figure 6b describes the changes in policy gradient loss per episode for both algorithms. It can be observed that the policy loss of SFPD trended toward zero during the early training stage and remained stable throughout the process, indicating a smooth and well-converged policy optimization. In contrast, the SAC algorithm exhibited severe fluctuations in policy loss during the initial training stage, with significant negative gradients observed around episodes 50 to 70, which may reflect instability in its policy updates. While the policy loss of SAC gradually stabilized later, its overall performance remained less stable compared to SFPD.

Figure 6c presents the loss trends of the Q networks (Q1 and Q2) for both algorithms across training episodes. Results show that the Q network loss of SFPD decreased rapidly in the early training stage and stabilized thereafter, indicating effective control of value function estimation errors. On the other hand, the Q network loss of SAC showed large fluctuations during the early stages, and dramatic oscillations occurred around episodes 50 to 70. This aligned with the fluctuations observed in SAC’s cumulative rewards and policy loss, further indicating instability in value function estimation during that phase. Overall, SFPD exhibited superior stability and performance in Q network updates.

Figure 6d displays the trends in V network loss across episodes for the two algorithms. The V network loss of SFPD declined rapidly in the early training stage and remained at a low level throughout the process, suggesting a stable and efficient value function learning process. In contrast, the SAC algorithm experienced significant peaks in V network loss around episodes 50 to 70, reflecting instability in value function learning. This phenomenon was consistent with the fluctuations observed in SAC’s policy and Q network performance, further confirming its overall optimization instability.

The comparative analysis suggested that SFPD outperformed SAC in terms of cumulative rewards, policy loss, Q network loss, and V network loss. Its primary advantages lay in the rapid convergence and high stability during the training process. This suggests that SFPD is better suited than SAC for solving complex dynamic optimization problems in pendulum control tasks.

Testing Performance Evaluation

To evaluate the effectiveness of the proposed method, a series of controlled experiments were conducted, comparing its performance with Soft Actor–Critic (SAC) and traditional Proportional–Derivative (PD) controllers. The experiments were designed to assess response speed, accuracy, stability, noise robustness, and disturbance rejection capability. The initial evaluation was conducted under fixed initial conditions (

θ = - 135^{\circ}

,

\dot{θ} = 2

rad/s) to ensure a controlled comparison. The control parameters of the traditional PD algorithm were determined using the Ziegler–Nichols tuning method (

K_{p} = 5.708

,

K_{d} = 9.510

). As shown in Figure 7a, the proposed SFPD algorithm exhibited superior speed, accuracy, and stability compared to SAC and traditional PD.

To analyze the impact of sensor noise on control performance, a Gaussian white noise measurement disturbance with

μ \sim N (0, 0 . 1^{2})

was introduced. Figure 7b demonstrates that SFPD exhibited strong resilience to noise, owing to its Markov-based control strategy, which ensured decisions were made based on the current state, thereby mitigating the accumulation of errors over time. To further assess disturbance rejection capability, an external torque perturbation was applied [43] between steps 100 and 102. As shown in Figure 7c, SFPD demonstrated the fastest recovery with minimal oscillations, highlighting its exceptional robustness. SAC maintained stability but exhibited a slower response, while traditional PD showed significant overshoot and oscillations, indicating its limited disturbance rejection capability.

A detailed quantitative analysis of the control performance is presented in Table 4. In noise-free conditions, SFPD achieved a 12.15% reduction in Integral of Squared Error (ISE) and a 40.19% reduction in Integral of Absolute Error (IAE) compared to SAC. Additionally, SFPD exhibited a 59.16% improvement in ISE and a 74.20% improvement in IAE over traditional PD, highlighting its superior error minimization. Even under measurement noise interference, SFPD remained highly effective, achieving a 7.54% reduction in ISE over SAC. Although SAC outperformed SFPD in IAE under these conditions (SAC: 1231, SFPD: 1368), SFPD still significantly outperformed traditional PD, achieving improvements of 53.69% in ISE and 51.70% in IAE, further demonstrating its robustness in noisy environments.

Overall, the experimental results demonstrated that SFPD outperformed SAC and traditional PD across various control scenarios. The proposed method achieved faster response, reduced steady-state errors, superior robustness to measurement noise, and enhanced disturbance rejection, making it a more effective solution for nonlinear control applications.

Test with Randomized Initial Conditions

In previous experiments, fixed initial conditions were used to provide a controlled assessment of the proposed method’s performance. To further evaluate its robustness and generalization capability, we then introduced randomized initial conditions [44]. This allowed for a more comprehensive assessment by ensuring the controller could adapt to a wider range of operating conditions and disturbances. Figure 8 presents a comparative analysis of the angular error performance of the SAC, SFPD, and traditional PD controllers under varying initial conditions and noise, highlighting the superior adaptability of the proposed method in handling uncertainties.

Across all 12 angles in Figure 8a, the SFPD (blue curve) consistently demonstrated faster convergence to near-zero angular error compared to the SAC (red curve) and traditional PD (green curve). For most angles, the SFPD algorithm showed reduced oscillations and overshoot during the initial error correction phase, highlighting its stability and precision. SAC displayed comparable performance but generally lagged slightly in terms of convergence speed. Traditional PD exhibited slower response times and significant overshoots, especially for larger initial angles.

Noise introduced noticeable fluctuations in angular error across all controllers as shown in Figure 8b. However, SFPD continued to maintain superior convergence performance with minimal oscillations compared to SAC and traditional PD. SAC showed improved noise tolerance compared to traditional PD, but its fluctuations remained larger than those of SFPD. Traditional PD suffered from larger amplitude oscillations and slower stabilization under noisy conditions, indicating its vulnerability to measurement noise.

4.2.2. Case 2: CSTR System

Training Performance Evaluation

The training process, as shown in Figure 9, involved 50 epochs for both the SFPD and SAC algorithms. The fuzzy membership initialization parameters for SFPD’s FNNs were uniformly distributed within their respective value ranges, with the number of fuzzy memberships set to seven.

From Figure 9, the training process on the CSTR system highlighted the significant advantages of SFPD over SAC. In Figure 9a, the cumulative reward (CR) for SFPD remained stable, consistently approaching 1000, while SAC exhibited substantial fluctuations and slower convergence, revealing instability in its learning process. Regarding policy loss, SFPD quickly minimized the loss and maintained stability, whereas SAC experienced higher losses and a slower decline, emphasizing SFPD’s superior efficiency in policy optimization. As shown in Figure 9b, the Q-network loss for SFPD was consistently low and stable from the outset, while SAC began with high losses that gradually decreased but still exhibited noticeable errors, underscoring SFPD’s accuracy in value function estimation. Similarly, Figure 9c illustrates that SFPD maintained low and stable V-network loss throughout, while SAC exhibited significant initial fluctuations before stabilizing, further demonstrating SFPD’s advantages in state-value function estimation and policy stability. Overall, SFPD outperformed SAC in terms of stability, convergence speed, and loss control, proving to be a more effective solution for complex dynamic system control tasks. Notably, when the exploration space is large, such as with vast action spaces, SAC’s convergence becomes significantly slower or may even fail entirely, whereas the proposed method continues to exhibit robust and reliable convergence performance.

Testing Performance Evaluation

Figure 10 presents the testing results, comparing the performance of the SAC, SFPD, and traditional PD algorithms. To evaluate noise robustness, a Gaussian white noise measurement disturbance signal with

μ \sim N (0, 1^{2})

was introduced, as depicted in Figure 10b. Additionally, to assess disturbance rejection capability, an external torque disturbance was applied [45] between steps 1000 and 1005, as shown in Figure 10c. Unlike previous tests that employed a positional PD controller, this experiment utilized a traditional incremental PD controller. The mathematical formulation of the incremental PD controller is given below:

T_{c} (t + 1) = T_{c} (t) + K_{p} \times e + K_{d} \times e^{'}

(25)

where

T_{c} (t + 1)

denotes the next time step’s control signal, while

T_{c} (t)

denotes the current time’s control signal. The parameters for the PD algorithm were determined using the Ziegler–Nichols method, with

K_{p} = 6.690

and

K_{d} = 1.842

.

From Figure 10, it can be observed that the SAC algorithm struggled with stability, showing significant oscillations. In contrast, both the SFPD and traditional PD algorithms managed to control the system effectively. Figure 10b illustrates that the SFPD algorithm demonstrated higher robustness compared to the traditional PD algorithm. Table 5 provides the quantitative data. From Figure 10c, it is evident that SFPD (red line) demonstrated superior control stability, maintaining the temperature close to the reference value (yellow line) with minimal fluctuations. In contrast, SAC (blue dashed line) exhibited significant instability, with large oscillations and a pronounced peak around step 1000, exceeding 350 °C. Traditional PD (green dashed line) performed better than SAC but still showed noticeable deviations and slower recovery.

From the analysis of the data in Table 5, it can be observed that the SFPD algorithm showed significant improvements in both the absence and presence of Gaussian noise interference. In the absence of noise, the SFPD algorithm showed substantial improvements in ISE across different sampling intervals when compared to the SAC algorithm, with percentage gains of 17.14%, 66.44%, 85.88%, 37.84%, and 65.14%. However, when compared to the traditional PD algorithm, the improvements were less pronounced, with percentage changes of −5.40%, 0%, −0.05%, −7.73%, and −0.84%, indicating some negative improvements in specific cases.

When noise was introduced, the SFPD algorithm’s performance in ISE remained strong relative to the SAC algorithm, achieving improvements of 41.88%, 68.93%, 93.46%, 73.15%, and 82.68%. Comparatively, the improvements over the traditional PD algorithm were more moderate, with percentage gains of 7.94%, 41.00%, 1.24%, 12.35%, and 21.06%.

Similarly, the SFPD algorithm exhibited significant improvements in IAE. Without noise, the percentage improvements over the SAC algorithm were 64.06%, 83.48%, 92.58%, 84.77%, and 88.63%. Compared to the traditional PD algorithm, the gains were more modest, with percentage changes of 12.88%, 15.56%, 0.46%, 2.50%, and 2.40%. Under noisy conditions, the SFPD algorithm continued to outperform the SAC algorithm, with percentage improvements in IAE of 65.69%, 72.73%, 93.05%, 77.99%, and 87.94%. The enhancements over the traditional PD algorithm in this scenario were 32.04%, 48.44%, 33.12%, 42.45%, and 53.46%.

Overall, the SFPD algorithm excelled in both ISE and IAE metrics, particularly in challenging conditions with noise interference. While the advantages of the SFPD algorithm over the traditional PD algorithm were less pronounced in some instances, it required no human intervention and demonstrated greater robustness. The SFPD algorithm consistently outperformed the SAC algorithm, making it a strong candidate for applications where robustness and stability are crucial.

4.3. Complexity Analysis

4.3.1. Space Complexity Analysis

The space complexity of each layer in the model is as follows:

For the input layer, the space required to store the input data x, where there are n input features, is $O (n)$ .
In the fuzzification layer, each input $x_{i}$ computes j fuzzy membership values (since there are j fuzzy membership functions per input). The total space required for these fuzzy membership values is $O (i \cdot j)$ .
In the multiplication layer (fuzzy inference), we store the fuzzy inference result $α$ for each rule. With m rules, the space complexity is $O (m)$ .
The normalization layer generates a new vector $β_{j}$ , requiring $O (m)$ space to store the values.
The defuzzification layer calculates the output $y_{i}$ , and for r outputs, the space required is $O (r)$ .

Thus, the overall space complexity is:

O (n) + O (i \cdot j) + O (m) + O (m) + O (r)

(26)

Simplified, this becomes:

O (n + i \cdot j + m + r)

(27)

As the system’s input dimensionality i increases, the number of fuzzy rules m grows exponentially (i.e.,

m = j^{i}

), leading to a rapid increase in space requirements. Moreover, when increasing the number of fuzzy membership functions j, space complexity increases exponentially. Therefore, as the system complexity grows, the number of fuzzy rules increases significantly, potentially making real-time implementation more challenging due to the high space demands.

4.3.2. Time Complexity Analysis

The time complexity of each layer in the model is as follows:

In the input layer, applying the tanh function to each input element has a time complexity of $O (1)$ . For n input features, the overall time complexity is $O (n)$ .
In the fuzzification layer, we compute j fuzzy membership values for each input $x_{i}$ , and each computation takes $O (1)$ time. Therefore, for i inputs and j memberships, the total time complexity is $O (i \cdot j)$ .
The multiplication layer (fuzzy inference) calculates the fuzzy inference result $α$ for each rule by multiplying n fuzzy memberships. The time complexity for each rule is $O (n)$ , and for m rules, the total time complexity is $O (n \cdot m)$ .
The normalization layer involves summing m fuzzy inference results and normalizing each $α_{j}$ , which takes $O (m)$ time.
The defuzzification layer computes the output $y_{i}$ , with a time complexity of $O (m \cdot r)$ , as there are r outputs requiring summation of m values.

Thus, the overall time complexity is:

O (n) + O (i \cdot j) + O (n \cdot m) + O (m) + O (m \cdot r)

(28)

Simplified, this becomes:

O (n \cdot m + i \cdot j + m \cdot r)

(29)

When increasing the number of fuzzy membership functions j, the number of fuzzy rules m increases, which leads to an exponential increase in time complexity. For high-dimensional systems, this can result in a significant increase in computational time per step, which could pose a challenge for real-time implementation. Thus, adding more input variables or membership functions could drastically increase the computational load, impacting system performance.

5. Conclusions

This paper introduced a novel SFPD controller that integrated the SAC framework with an FNN-based PD controller, leveraging reinforcement learning to enable automatic fuzzy rule adjustment for optimal PD control. The partial interpretability of FNNs allowed expert knowledge to be incorporated during initialization, facilitating faster convergence early in training. The controller employed stochastic optimization and integrated entropy during training, enhancing exploration efficiency and significantly reducing training time. Experimental results demonstrated that the SFPD controller achieved faster convergence and superior control performance compared to both SAC and traditional PD controllers. In large exploration spaces, SAC often struggles with convergence or fails entirely, whereas the proposed method maintains high performance and robustness. A key challenge in applying SAC to PID control is its inability to incorporate the integral (I) component, as this introduces dependence on past states, violating the Markov property. This limitation reduces the effectiveness of replay buffers, potentially causing instability and slower convergence. Therefore, our future work will focus on finding ways to isolate and compute the integral term separately in order to maintain stability and guarantee convergence. Additionally, addressing limitations in high-noise environments, complex nonlinear relationships, and large action spaces can further enhance the model’s robustness. In noisy environments, excessive disturbances can destabilize training, while complex relationships and large action spaces may not only increase computational complexity but also heighten the risk of overfitting. Furthermore, incorporating type-2 fuzzy control could further enhance adaptability and robustness.

Author Contributions

Q.H.: concept and algorithm proposal, design, simulation, writing—original draft preparation. F.B. and M.B.: guidance, supervision, review, and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research was financially supported by the Australian Research Council (ARC DP210101682) and the UWA Research Collaboration Award (2023/GR001286).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data analysis for this study can be obtained from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, Y.; Ang, K.H.; Chong, G.C. Patents, software, and hardware for PID control: An overview and analysis of the current art. IEEE Control. Syst. Mag. 2006, 26, 42–54. [Google Scholar]
Tang, K.S.; Man, K.F.; Chen, G.; Kwong, S. An optimal fuzzy PID controller. IEEE Trans. Ind. Electron. 2001, 48, 757–765. [Google Scholar] [CrossRef]
Zhou, X.; Xu, C. Fuzzy-PID-based trajectory tracking for 3WIS robot. In Proceedings of the International Conference on Mechatronic Engineering and Artificial Intelligence (MEAI 2023), Shenyang, China, 15–17 December 2023; SPIE: Bellingham, WA, USA, 2024; Volume 13071, pp. 826–834. [Google Scholar]
Liu, B.; Li, J.; Zhou, X.; Li, X. Design of brushless DC motor simulation action system based on Fuzzy PID. In Proceedings of the Ninth International Symposium on Sensors, Mechatronics, and Automation System (ISSMAS 2023), Nanjing, China, 11–13 August 2023; SPIE: Bellingham, WA, USA, 2024; Volume 12981, pp. 276–280. [Google Scholar]
Ouyang, P.; Acob, J.; Pano, V. PD with sliding mode control for trajectory tracking of robotic system. Robot. Comput. Integr. Manuf. 2014, 30, 189–200. [Google Scholar] [CrossRef]
Chi, R.; Li, H.; Shen, D.; Hou, Z.; Huang, B. Enhanced P-type control: Indirect adaptive learning from set-point updates. IEEE Trans. Autom. Control 2022, 68, 1600–1613. [Google Scholar] [CrossRef]
Carvajal, J.; Chen, G.; Ogmen, H. Fuzzy PID controller: Design, performance evaluation, and stability analysis. Inf. Sci. 2000, 123, 249–270. [Google Scholar] [CrossRef]
Laib, A.; Gharib, M. Design of an Intelligent Cascade Control Scheme Using a Hybrid Adaptive Neuro-Fuzzy PID Controller for the Suppression of Drill String Torsional Vibration. Appl. Sci. 2024, 14, 5225. [Google Scholar] [CrossRef]
İrgan, H.; Menak, R.; Tan, N. A comparative study on PI-PD controller design using stability region centroid methods for unstable, integrating and resonant systems with time delay. Meas. Control 2025, 58, 245–265. [Google Scholar] [CrossRef]
Bhattacharjee, D.; Kim, W.; Chattopadhyay, A.; Waser, R.; Rana, V. Multi-valued and fuzzy logic realization using TaOx memristive devices. Sci. Rep. 2018, 8, 8. [Google Scholar] [CrossRef]
Mudi, R.K.; Pal, N.R. A robust self-tuning scheme for PI-and PD-type fuzzy controllers. IEEE Trans. Fuzzy Syst. 1999, 7, 2–16. [Google Scholar] [CrossRef]
Kumar, P.; Nema, S.; Padhy, P. Design of Fuzzy Logic based PD Controller using cuckoo optimization for inverted pendulum. In Proceedings of the 2014 IEEE International Conference on Advanced Communications, Control and Computing Technologies, Ramanathapuram, India, 8–10 May 2014; IEEE: New York, NY, USA, 2014; pp. 141–146. [Google Scholar]
Zhuang, Q.; Xiao, M.; Tao, B.; Cheng, S. Bifurcations control of IS-LM macroeconomic system via PD controller. In Proceedings of the 2021 China Automation Congress (CAC), Beijing, China, 22–24 October 2021; IEEE: New York, NY, USA, 2021; pp. 4565–4570. [Google Scholar]
Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; et al. Soft actor-critic algorithms and applications. arXiv 2018, arXiv:1812.05905. [Google Scholar]
He, J.; Su, S.; Wang, H.; Chen, F.; Yin, B. Online PID Tuning Strategy for Hydraulic Servo Control Systems via SAC-Based Deep Reinforcement Learning. Machines 2023, 11, 593. [Google Scholar] [CrossRef]
Yu, X.; Fan, Y.; Xu, S.; Ou, L. A self-adaptive SAC-PID control approach based on reinforcement learning for mobile robots. Int. J. Robust Nonlinear Control 2022, 32, 9625–9643. [Google Scholar] [CrossRef]
Song, L.; Xu, C.; Hao, L.; Yao, J.; Guo, R. Research on PID parameter tuning and optimization based on SAC-auto for USV path following. J. Mar. Sci. Eng. 2022, 10, 1847. [Google Scholar] [CrossRef]
Neves, D.E.; Ishitani, L.; do Patrocínio Júnior, Z.K.G. Advances and challenges in learning from experience replay. Artif. Intell. Rev. 2024, 58, 54. [Google Scholar] [CrossRef]
Di-Castro, S.; Mannor, S.; Di Castro, D. Analysis of stochastic processes through replay buffers. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; PMLR: New York, NY, USA, 2022; pp. 5039–5060. [Google Scholar]
Boubertakh, H.; Tadjine, M.; Glorennec, P.Y.; Labiod, S. Tuning fuzzy PD and PI controllers using reinforcement learning. ISA Trans. 2010, 49, 543–551. [Google Scholar] [CrossRef]
Puriel-Gil, G.; Yu, W.; Sossa, H. Reinforcement learning compensation based PD control for inverted pendulum. In Proceedings of the 2018 15th International Conference on Electrical Engineering, Computing Science and Automatic Control (CCE), Mexico City, Mexico, 5–7 September 2018; IEEE: New York, NY, USA, 2018; pp. 1–6. [Google Scholar]
Gil, G.P.; Yu, W.; Sossa, H. Reinforcement learning compensation based PD control for a double inverted pendulum. IEEE Lat. Am. Trans. 2019, 17, 323–329. [Google Scholar]
Shukur, F.; Mosa, S.J.; Raheem, K.M. Optimization of Fuzzy-PD Control for a 3-DOF Robotics Manipulator Using a Back-Propagation Neural Network. Math. Model. Eng. Probl. 2024, 11, 199. [Google Scholar] [CrossRef]
Reddy, K.H.; Sharma, S.; Prasad, A.; Krishna, S. Neural Network based Fuzzy plus PD Controller for Speed Control in Electric Vehicles. In Proceedings of the 2024 IEEE International Conference on Interdisciplinary Approaches in Technology and Management for Social Innovation (IATMSI), Gwalior, India, 14–16 March 2024; IEEE: New York, NY, USA, 2024; Volume 2, pp. 1–4. [Google Scholar]
McCutcheon, L.; Fallah, S. Adaptive PD Control using Deep Reinforcement Learning for Local-Remote Teleoperation with Stochastic Time Delays. In Proceedings of the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023; IEEE: New York, NY, USA, 2023; pp. 7046–7053. [Google Scholar]
Kwan, H.K.; Cai, Y. A fuzzy neural network and its application to pattern recognition. IEEE Trans. Fuzzy Syst. 1994, 2, 185–193. [Google Scholar] [CrossRef]
Dash, P.; Pradhan, A.; Panda, G. A novel fuzzy neural network based distance relaying scheme. IEEE Trans. Power Deliv. 2000, 15, 902–907. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; A Bradford Book; MIT Press: Cambridge, MA, USA, 2018; pp. 1–2. [Google Scholar]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; PMLR: New York, NY, USA, 2018; pp. 1861–1870. [Google Scholar]
Amari, S.i. Backpropagation and stochastic gradient descent method. Neurocomputing 1993, 5, 185–196. [Google Scholar] [CrossRef]
Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Körpeoğlu, S.G.; Filiz, A.; Yıldız, S.G. AI-driven predictions of mathematical problem-solving beliefs: Fuzzy logic, adaptive neuro-fuzzy inference systems, and artificial neural networks. Appl. Sci. 2025, 15, 494. [Google Scholar] [CrossRef]
Villa-Ávila, E.; Arévalo, P.; Ochoa-Correa, D.; Espinoza, J.L.; Albornoz-Vintimilla, E.; Jurado, F. Improving V2G Systems Performance with Low-Pass Filter and Fuzzy Logic for PV Power Smoothing in Weak Low-Voltage Networks. Appl. Sci. 2025, 15, 1952. [Google Scholar] [CrossRef]
Lu, W.; Liang, J.; Su, H. Research on Energy-Saving Optimization Method and Intelligent Control of Refrigeration Station Equipment Based on Fuzzy Neural Network. Appl. Sci. 2025, 15, 1077. [Google Scholar] [CrossRef]
Juang, C.F.; Chen, T.C.; Cheng, W.Y. Speedup of implementing fuzzy neural networks with high-dimensional inputs through parallel processing on graphic processing units. IEEE Trans. Fuzzy Syst. 2011, 19, 717–728. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, Z.; Tazeddinova, D.; Ebrahimi, F.; Habibi, M.; Safarpour, H. Enhancing active vibration control performances in a smart rotary sandwich thick nanostructure conveying viscous fluid flow by a PD controller. Waves Random Complex Media 2024, 34, 1835–1858. [Google Scholar] [CrossRef]
Yang, R.; Gao, Y.; Wang, H.; Ni, X. Fuzzy neural network PID control used in Individual blade control. Aerospace 2023, 10, 623. [Google Scholar] [CrossRef]
Prado, R.; Melo, J.; Oliveira, J.; Neto, A.D. FPGA based implementation of a Fuzzy Neural Network modular architecture for embedded systems. In Proceedings of the The 2012 International Joint Conference on Neural Networks (IJCNN), Brisbane, QLD, Australia, 10–15 June 2012; IEEE: New York, NY, USA, 2012; pp. 1–7. [Google Scholar]
Anusha, S.; Karpagam, G.; Bhuvaneswarri, E. Comparison of tuning methods of PID controller. Int. J. Manag. Inf. Technol. Eng. 2014, 2, 1–8. [Google Scholar]
Oh, S.K.; Jang, H.J.; Pedrycz, W. Optimized fuzzy PD cascade controller: A comparative analysis and design. Simul. Model. Pract. Theory 2011, 19, 181–195. [Google Scholar] [CrossRef]
Chowdhury, M.A.; Al-Wahaibi, S.S.; Lu, Q. Entropy-maximizing TD3-based reinforcement learning for adaptive PID control of dynamical systems. Comput. Chem. Eng. 2023, 178, 108393. [Google Scholar] [CrossRef]
Chowdhury, M.A.; Lu, Q. A novel entropy-maximizing TD3-based reinforcement learning for automatic PID tuning. In Proceedings of the 2023 American Control Conference (ACC), San Diego, CA, USA, 31 May–2 June 2023; IEEE: New York, NY, USA, 2023; pp. 2763–2768. [Google Scholar]
Dogru, O.; Velswamy, K.; Ibrahim, F.; Wu, Y.; Sundaramoorthy, A.S.; Huang, B.; Xu, S.; Nixon, M.; Bell, N. Reinforcement learning approach to autonomous PID tuning. Comput. Chem. Eng. 2022, 161, 107760. [Google Scholar] [CrossRef]
Shi, Q.; Lam, H.K.; Xuan, C.; Chen, M. Adaptive neuro-fuzzy PID controller based on twin delayed deep deterministic policy gradient algorithm. Neurocomputing 2020, 402, 183–194. [Google Scholar] [CrossRef]
Shuprajhaa, T.; Sujit, S.K.; Srinivasan, K. Reinforcement learning based adaptive PID controller design for control of linear/nonlinear unstable processes. Appl. Soft Comput. 2022, 128, 109450. [Google Scholar] [CrossRef]

Figure 1. Reinforcement learning framework: interaction between the agent and environment, where s represents the state, r denotes the reward, and a is the action taken by the agent.

Figure 2. Structure and functionality of a fuzzy logic system (FLS), illustrating the process of fuzzification, rule evaluation, and defuzzification.

Figure 3. Neural network architecture of the fuzzy neural network PD (FNNPD) controller.

Figure 4. Proposed framework for SFPD control: The upper part illustrates the backward propagation process, where inputs are passed through the FNNs to generate PD parameters, which are then used by the PD controller to produce the final control signal a. The lower part depicts the forward propagation process, where training of the neural networks is conducted by resampling tuples (

S_{t}

, a, r,

S_{t + 1}

) from the replay buffer.

Figure 4. Proposed framework for SFPD control: The upper part illustrates the backward propagation process, where inputs are passed through the FNNs to generate PD parameters, which are then used by the PD controller to produce the final control signal a. The lower part depicts the forward propagation process, where training of the neural networks is conducted by resampling tuples (

S_{t}

, a, r,

S_{t + 1}

) from the replay buffer.

Figure 5. Pendulum system.

Figure 6. Training process on the pendulum: (a) CR over training epochs. (b) Policy loss over training epochs. (c) Q loss over training epochs. (d) V Loss over training epochs.

Figure 7. Testing results on the pendulum: (a) without Gaussian noise; (b) with Gaussian noise; (c) with Gaussian noise and external disturbance.

Figure 8. Random initial positions test on the pendulum: (a) without Gaussian noise; (b) with Gaussian noise.

Figure 9. Training process on CSTR: (a) CR over training epochs. (b) Policy loss over training epochs. (c) Q Loss over training epochs. (d) V loss over training epochs.

Figure 10. Testing results on CSTR: (a) without Gaussian noise; (b) with Gaussian noise; (c) with Gaussian noise and external disturbance.

Table 1. Parameters and hyperparameters.

Notation	Description	Case 1: Pendulum	Case 2: CSTR
P	Number of training episodes	10	50
R	Replay buffer size	20,000	20,000
B	Batch size	32	32
A	Action space	[−10, 10]	[50, 420]
-	Learning rate	0.001	0.001
-	Length of steps for testing	200	1200
-	Optimizer	Adam	Adam
-	P/D parameter range	[−100, 100]/[−20, 20]	[−5, 5]/[−2, 2]

Table 2. Parameters for the pendulum.

Notation	Description	Value
g	Gravity acceleration	10
l	The length of the pendulum	1.0
m	The mass of the pendulum	1.0
u	The applied torque	[−10,10]
$θ$	The current angle of the pendulum	[ $- π, π$ ]
$\dot{θ}$	The current angular velocity of the pendulum	[−8,8]

Table 3. Parameters of CSTR model.

Param	Value	Unit	Description
F	1	m³/h	Liquid flow rate in the vessel
V	1	m³	Reactor volume
R	1.985875	kcal/(kmol·K)	Boltzmann’s ideal gas constant
$Δ H$	−5.960	kcal/kmol	Heat of chemical reaction
E	11,843	kcal/kmol	Activation energy
$k_{0}$	34,930,800	1/h	Pre-exponential nonthermal factor
$ρ C_{p}$	500	kcal/(m³·K)	Density multiplied by heat capacity
$U A$	150	kcal/(K·h)	Overall heat transfer coefficient multiplied by vessel area

Table 4. Quantitative metrics of testing process on the pendulum (Note: WGN = with Gaussian noise).

WGN	SFPD		SAC		Traditional PD
WGN	ISE	IAE	ISE	IAE	ISE	IAE
No	104,046	1158	341,062	3062	138,813	2659
Yes	108,167	1852	351,541	3267	153,669	3399

Table 5. Quantitative metrics of testing results on CSTR (Note: SI = sampling interval, WGN = with Gaussian noise).

SI	WGN	SFPD		SAC		Traditional PD
SI	WGN	ISE	IAE	ISE	IAE	ISE	IAE
0–100 s	No	1757	115	2121	320	1667	132
	Yes	1746	140	3003	408	1896	206
100–200 s	No	252	38	751	230	252	45
	Yes	272	66	875	242	461	128
200–500 s	No	4351	218	30,832	2939	4349	219
	Yes	4539	309	69,403	4449	4596	462
500–800 s	No	1031	78	1659	512	957	80
	Yes	1043	189	3885	859	1190	329
800–1200 s	No	2644	122	7584	1074	2622	125
	Yes	2777	242	16,036	2008	3518	520

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Han, Q.; Boussaid, F.; Bennamoun, M. Soft Fuzzy Reinforcement Neural Network Proportional–Derivative Controller. Appl. Sci. 2025, 15, 5071. https://doi.org/10.3390/app15095071

AMA Style

Han Q, Boussaid F, Bennamoun M. Soft Fuzzy Reinforcement Neural Network Proportional–Derivative Controller. Applied Sciences. 2025; 15(9):5071. https://doi.org/10.3390/app15095071

Chicago/Turabian Style

Han, Qiang, Farid Boussaid, and Mohammed Bennamoun. 2025. "Soft Fuzzy Reinforcement Neural Network Proportional–Derivative Controller" Applied Sciences 15, no. 9: 5071. https://doi.org/10.3390/app15095071

APA Style

Han, Q., Boussaid, F., & Bennamoun, M. (2025). Soft Fuzzy Reinforcement Neural Network Proportional–Derivative Controller. Applied Sciences, 15(9), 5071. https://doi.org/10.3390/app15095071

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Soft Fuzzy Reinforcement Neural Network Proportional–Derivative Controller

Abstract

1. Introduction

2. Preliminaries

2.1. Reinforcement Learning (RL)

2.2. Fuzzy Neural Network PD (FNNPD) Controller

3. Proposed Soft Fuzzy Reinforcement Neural Network PD (SFPD) Controller

3.1. Forward Propagation

3.2. Backward Propagation

4. Results and Discussion

4.1. Experimental Setup

4.1.1. Case 1: Pendulum-v1

System Model

Reward Function

4.1.2. Case 2: CSTR System

System Model

Reward Function

4.2. Experimental Results

4.2.1. Case 1: Pendulum-v1

Training Performance Evaluation

Testing Performance Evaluation

Test with Randomized Initial Conditions

4.2.2. Case 2: CSTR System

Training Performance Evaluation

Testing Performance Evaluation

4.3. Complexity Analysis

4.3.1. Space Complexity Analysis

4.3.2. Time Complexity Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI