Adaptive PPO-RND Optimization Within Prescribed Performance Control for High-Precision Motion Platforms

Wang, Yimin; Xu, Jingchong; Gao, Kaina; Wang, Junjie; Bu, Shi; Liu, Bin; Xing, Jianping

doi:10.3390/math13213439

Open AccessArticle

Adaptive PPO-RND Optimization Within Prescribed Performance Control for High-Precision Motion Platforms

by

Yimin Wang

¹

,

Jingchong Xu

²,

Kaina Gao

²,

Junjie Wang

²,

Shi Bu

²,

Bin Liu

² and

Jianping Xing

^1,*

¹

School of Integrated Circuits, Shandong University, Jinan 250101, China

²

45th Research Institute of China Electronics Technology Group Corporation, Beijing 100176, China

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(21), 3439; https://doi.org/10.3390/math13213439

Submission received: 19 September 2025 / Revised: 21 October 2025 / Accepted: 25 October 2025 / Published: 28 October 2025

(This article belongs to the Special Issue Artificial Intelligence and Internet of Things for Intelligent Systems)

Download

Browse Figures

Versions Notes

Abstract

The continuous reduction in critical dimensions and the escalating demands for higher throughput are driving motion platforms to operate under increasingly complex conditions, including multi-axis coupling, structural nonlinearities, and time-varying operational scenarios. These complexities make the trade-offs among precision, speed, and robustness increasingly challenging. Traditional Proportional–Integral–Derivative (PID) controllers, which rely on empirical tuning methods, suffer from prolonged trial-and-error cycles and limited transferability, and consequently struggle to maintain optimal performance under these complex working conditions. This paper proposes an adaptive β–Proximal Policy Optimization with Random Network Distillation (β-PPO-RND) parameter optimization within the Prescribed Performance Control (PPC) framework. The adaptive coefficient β is updated based on the temporal change in reward difference, which is clipped and smoothly mapped to a preset range using a hyperbolic tangent function. This mechanism dynamically balances intrinsic and extrinsic rewards—encouraging broader exploration in the early stage and emphasizing performance optimization in the later stage. Experimental validation on a Permanent Magnet Linear Synchronous Motor (PMLSM) platform confirms the effectiveness of the proposed approach. It eliminates the need for manual tuning and enables real-time controller parameter adjustment within the PPC framework, achieving high-precision trajectory tracking and a significant reduction in steady-state error. Experimental results show that the proposed method achieves MAE = 0.135 and RMSE = 0.154, representing approximately 70% reductions compared to the conventional PID controller.

Keywords:

adaptive β-PPO-RND; prescribed performance control; high-precision trajectory tracking; Permanent Magnet Linear Synchronous Motor; reinforcement learning; steady-state error

MSC:

68T99

1. Introduction

With the continuous evolution of semiconductor technology, particularly the pursuit of smaller critical dimensions, higher throughput, and more robust processes, precision control technologies for semiconductor equipment are facing unprecedented challenges. Against this backdrop, the control of precision mechanical systems, especially core motion platforms such as lithography and metrology equipment, has become a critical factor driving advancements in semiconductor manufacturing. The accuracy and stability of these motion platforms directly determine the precision and efficiency of the manufacturing process, particularly in achieving sub-micron feature sizes and nanometer-level overlay accuracy. Consequently, optimizing control systems to enhance trajectory-tracking precision and system response speed has emerged as one of the core issues in semiconductor equipment control research [1,2].

Due to its simple structure, clear physical interpretation, and ease of engineering implementation, the PID controller remains a mainstream approach in many precision motion control systems [3,4]. Particularly in early lithography and metrology equipment, PID control was often capable of meeting sub-micron accuracy requirements [5,6,7]. However, as the precision requirements of semiconductor manufacturing continue to increase, traditional PID control has gradually revealed limitations when confronting complex nonlinearities, multi-axis coupling, and time-varying dynamic characteristics [8]. To overcome these shortcomings, many researchers have proposed adaptive PID control strategies integrated with optimization algorithms to address the shortcomings of traditional PID control in high-speed and high-precision motion applications. Liu et al. [9] proposed a fuzzy adaptive PID control for a shipborne 6-degree-of-freedom stabilization platform, which enhanced disturbance suppression in marine environments, reduced the settling time by 50%, and achieved zero overshoot. Tian et al. [10] developed an adaptive fuzzy PID control for a spinal surgical robot, which integrated B-spline curves and force feedback adjustment to improve cutting accuracy and stability. Noordin et al. [11] introduced an adaptive PID control combined with sliding mode control for quadrotor aircraft, which strengthened the system’s robustness and stability in highly dynamic environments. Zhou et al. [12] proposed a genetic algorithm-optimized feedforward PID control strategy, which enhanced the response speed and precision of the XY platform. Aboud et al. [13] integrated particle swarm optimization with fractional-order PID control, improving the trajectory accuracy of cable-driven parallel robots (CDPRs). Similarly, Sabo et al. [14] applied PSO and genetic algorithms to tune PID parameters for an automatic voltage regulator system, improving the overall dynamic performance. Ramadhani et al. [15] developed a method for retuning PID parameters in Dynamixel servo motors, which addressed issues of overheating and precision degradation under voltage fluctuations, thereby achieving stable control across varying voltage conditions. Li et al. [16] introduced a BP neural network-based tuning method for PID control, which enhanced wave compensation accuracy in a shipborne Stewart platform and significantly reduced positional deviation. Furthermore, Differential Evolution (DE) and its improved variants have also been applied to controller parameter optimization. Ghasemi et al. [17] proposed a self-competitive mutation strategy for Differential Evolution to enhance global search capability and convergence performance in PID-based and automatic voltage regulator systems.

Despite progress from various structural improvements, controller parameter tuning still relies heavily on empirical experience and lacks adaptability to time-varying conditions. To address this issue, numerous optimization-based tuning methods have been proposed [18]. Yang et al. [19] combined an artificial fish swarm algorithm with a BP neural network for automated PID tuning, thereby significantly improving the adjustment efficiency and enhancing the dynamic performance of an XY linear motor platform. Azeez et al. [20] employed an enhanced artificial bee colony algorithm to optimize PID parameters, thereby enhancing the robustness of robotic trajectory tracking. Several recent PID variants have been introduced to improve adaptability and nonlinear control capability. Suid and Ahmad [21] developed a sigmoid PID controller tuned by a nonlinear sine–cosine algorithm for automatic voltage regulator systems. Tumari et al. [22] proposed a neuroendocrine PID controller based on a data-driven hormone regulation mechanism optimized by the marine predator’s algorithm. BELBIC-based and fractional-order PID (FOPID) controllers have also been explored to enhance robustness and adaptability [23,24]. However, these methods are predominantly based on offline tuning and fixed structures once deployed. Although lightweight optimization algorithms such as the improved smoothed functional algorithm and norm-limited SPSA have been proposed for control parameter adjustment [25,26], they often exhibit limited convergence stability and adaptability in high-dimensional continuous systems. These inherent limitations have thus motivated the adoption of reinforcement learning for online, state-dependent parameter optimization. With the rapid advancement of artificial intelligence, deep reinforcement learning (RL) has emerged as a promising framework for advanced learning-based techniques to optimize control systems, attracting significant research interest [27,28]. Reinforcement learning learns optimal control strategies through interaction with the environment, demonstrating significant potential, especially in handling complex, dynamic, and nonlinear systems [29,30]. Among these, the Proximal Policy Optimization (PPO) algorithm, as an efficient and stable reinforcement learning method, has attracted significant research interest in the control field [31,32]. Zhou et al. [33] applied PPO to multi-agent adaptive PID parameter learning for a six-axis robotic arm, demonstrating faster training convergence than MA-DDPG/MA-SAC. Their method also outperformed traditional PID in terms of overshoot, settling time, and disturbance rejection performance. Additionally, Zhang et al. [34] proposed a PPO-based AirPilot adaptive PID controller, which exhibited superior convergence, accuracy, response smoothness, and robustness compared to the PX4 PID baseline in both simulations and physical tests. However, the standard PPO algorithm often suffers from insufficient exploration in complex or high-dimensional environments, leading to suboptimal policy convergence. To address these challenges, many studies have introduced Random Network Distillation (RND) as a curiosity-driven exploration module within reinforcement learning algorithms [35]. The RND module enhances the exploration capability of agents by generating intrinsic reward signals, thereby improving the learning efficiency of reinforcement learning [36].

Prescribed Performance Control (PPC) is a control framework notable for its ability to constrain both transient and steady-state errors within predefined bounds [37]. Unlike conventional controllers, PPC provides a systematic and reliable means of guaranteeing prescribed performance in precision motion control, which has attracted increasing attention in recent years. Sui and Tong [38] developed a finite-time fuzzy adaptive PPC for nonlinear MIMO systems, which enhanced convergence speed and reduced computational complexity. Han et al. [39] proposed a neural network–based PPC utilizing spectral normalization to improve robustness and transient response under model uncertainties. Compared with conventional PID-type controllers, these PPC-based approaches provide a more systematic and reliable means of guaranteeing prescribed performance in precision motion control. Nevertheless, studies integrating PPO with RND under the PPC framework remain limited, motivating the present work.

This paper proposes an improved PPO-RND-based controller parameter optimization method aimed at addressing the limitations of traditional controllers in precision motion control, which often rely on manual tuning and utilize static parameters. Unlike traditional methods, the proposed approach optimizes controller parameters within a PPC framework by introducing an adaptive beta value to dynamically balance the weight between extrinsic and intrinsic rewards, thereby enhancing the exploration efficiency of the PPO algorithm and enabling higher-precision trajectory tracking control in dynamic and complex environments. Experiments conducted on a Permanent Magnet Linear Synchronous Motor (PMLSM) platform demonstrate that the proposed method exhibits significant advantages in control precision and stability. Given the widespread application of PMLSM in precision motion control systems, particularly in semiconductor equipment motion platforms, the proposed method demonstrates considerable generalizability and application potential.

An innovative PPO-RND algorithm is proposed, which integrates Proximal Policy Optimization (PPO) and Random Network Distillation (RND) methods by introducing an adaptive beta parameter to dynamically regulate the balance between exploration and convergence. Traditional PPO algorithms typically employ a fixed strategy for exploration-exploitation trade-offs, whereas the proposed method adaptively adjusts beta to optimize the degree of exploration in real-time based on the system’s learning progress and reward feedback.
A dynamically optimized control framework is innovatively proposed by combining Prescribed Performance Control (PPC) with the adaptive beta PPO-RND algorithm, and implemented on a PMLSM platform. Unlike traditional PPC methods, this approach enables real-time optimization of controller parameters while simultaneously managing dynamic system errors. Experimental results indicate that the proposed method significantly improves control accuracy and enhances system stability and robustness in complex environments.

2. Modeling

In this study, a permanent magnet synchronous linear motor (PMSLM) is employed as the drive system for the motion platform. This motor drives the platform via electromagnetic force along the linear direction, ensuring precise control of translational motion. The research focuses on single-degree-of-freedom control of the motion platform to optimize the motor’s movement along the track direction, achieving high-precision linear positioning and trajectory tracking. The PMSLM mathematical model used in this study is based on our previous work [40], which was experimentally validated on a hardware-in-the-loop motion control platform. The main structure of the PMSLM is shown in Figure 1.

The electromagnetic thrust generated by a PMLSM is proportional to the control current and is related to the motor’s thrust constant

k_{t}

, as expressed in Equation (1).

F_{e} = k_{t} \cdot i_{q}

(1)

where

F_{e}

represents the electromagnetic thrust, and

i_{q}

denotes the reference current generated by the controller. This thrust acts on the mass of the platform, propelling it along the linear direction via the force supplied by the motor. The dynamic model of the platform, based on Newton’s second law of motion, incorporates the platform’s mass, damping effects, and external disturbances, as expressed in Equation (2).

m \cdot \ddot{x} + b \cdot \dot{x} = F_{e} + d (t)

(2)

where

m

denotes the equivalent mass of the platform,

b

represents the viscous damping coefficient,

\dot{x}

and

\ddot{x}

indicate the velocity and acceleration of the platform, respectively, and

d (t)

signifies the external disturbance term. This model accounts for the inertial and damping forces of the platform, while the external disturbance primarily serves to simulate potential environmental uncertainties. Given the high bandwidth of the current loop, it is assumed that the current can rapidly track its reference value. Thus, the dynamics of the current loop are neglected and simplified via

F_{e} = k_{t} \cdot i_{q}

. To maintain consistency with subsequent implementations, the outer loop is modeled in the force domain, denoted as

τ (t) = F_{e} (t) = k_{t} i_{q}^{r e f} (t)

, with the inner loop approximated as a static gain. The current constraint

|i_{q}| \leq i_{\max}

is equivalently expressed as a force constraint

|τ| \leq τ_{\max} = k_{t} i_{\max}

.

The corresponding transfer function from the control input

i_{q} (t)

to the position output

x (t)

is derived as (3).

G (s) = \frac{K_{t}}{m s^{2} + b s}

(3)

This transfer function represents a typical second-order dynamic model of the motion stage. The corresponding state-space representation can be written as Equation (4).

{\dot{x}}_{1} = x_{2}, {\dot{x}}_{2} = - \frac{b}{m} x_{2} + \frac{K_{t}}{m} i_{q}, y = x_{1}

(4)

This simplified second-order model captures the dominant motion dynamics and serves as the foundation for the subsequent controller design and performance evaluation. The following section will introduce the PPC architecture based on the adaptive beta PPO-RND algorithm.

3. Control Strategy

3.1. Based on the PPO-RND Prescribed Performance Control Architecture

This subsection presents the system control architecture, as illustrated in Figure 2. The adaptive

β

-PPO-RND algorithm is applied to tune the PID controller parameters online within the PPC framework.

During each training episode, the procedure of applying PPO-RND to tune the PPC-PID controller is as follows.

The policy network receives the transformed tracking error together with the plant state variables (such as velocity and acceleration) as the input state and outputs the corresponding PID gains

K_{p}, K_{i}, K_{d}

. The PPC-PID controller then generates the control signal, drives the plant, and produces the next state and tracking error. A composite reward is constructed by combining the external performance reward with the intrinsic exploration reward generated by the RND module. The intrinsic reward from RND is directly fed into the PPO objective to encourage exploration in regions where the prediction error is high, while the external reward guides the controller toward precise tracking performance. PPO jointly optimizes the policy and value networks based on this composite reward, thereby achieving a balance between exploration and control accuracy. Meanwhile, the RND predictor network is updated in parallel with PPO to continuously adjust the intrinsic reward signal according to the agent’s learning progress. Through iterative interaction and gradient updates, the PPO-RND framework converges to an optimal policy that adaptively tunes

K_{p}, K_{i}, K_{d}

in real time, ensuring high-precision tracking under PPC constraints.

3.2. The Structure of Prescribed Performance Control

The core objective of PPC is to ensure that the system output error strictly satisfies predefined performance constraints throughout the entire control process. It explicitly defines the convergence rate, steady-state error, and the maximum overshoot range of the error, thereby guaranteeing that the dynamic process conforms to specific performance requirements.

Prescribed performance is defined by a continuous, strictly positive performance function

ρ (t)

[41], which characterizes the convergence behavior of the allowable error upper bound over time, as expressed in Equation (5).

ρ (t) = (ρ_{0} - ρ_{\infty}) e^{- λ t} + ρ_{\infty}

(5)

where

ρ_{0} > ρ_{\infty} > 0

, representing the initial and steady-state error bounds, respectively, and

λ > 0

determines the convergence rate of the error bounds. The normalized error variable (transformed error) is defined by Equation (6), which maps the original error

e (t)

into an unconstrained domain, as given by Equation (7).

z (t) = e (t) / ρ (t)

(6)

ε (t) = T (z (t)) = \{\begin{matrix} \ln \frac{δ + z (t)}{δ (1 - z (t))}, e (0) \geq 0 \\ \ln \frac{δ (1 + z (t))}{δ - z (t)}, e (0) < 0 \end{matrix}

(7)

Let

T (\cdot)

be a homeomorphic mapping function, and

ε (t) \in (- \infty, + \infty)

be the mapped unconstrained variable. The transformed error

ε (t)

is adopted as the feedback quantity for the PID controller, and the control input

u (t)

is defined by Equation (8).

u (t) = K_{p} ε (t) + K_{i} \int_{0}^{t} ε (t) d t + K_{d} \dot{ε} (t)

(8)

where

K_{p}

,

K_{i}

and

K_{d}

represent the proportional gain, integral gain, and derivative gain of the PID controller, respectively.

3.3. Proximal Policy Optimization

The Actor-Critic architecture serves as the core of the stable training of PPO: the Actor learns the policy distribution to balance exploration and exploitation, while the Critic provides value estimation to suppress gradient variance and guide more robust updates.

Building upon this foundation, PPO ensures training stability by controlling the magnitude of policy distribution updates, thereby preventing excessive policy shifts. The network architectures of the Actor-Critic are illustrated in Figure 3. The objective function for policy optimization is given by Equation (9).

L^{C L I P} (θ) = {\hat{E}}_{t} [\min (r_{t} (θ) {\hat{A}}_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) {\hat{A}}_{t})]

(9)

where θ denotes the policy parameters,

{\hat{E}}_{t} [\cdot]

represents the expectation over empirical data at time step t, and

{\hat{A}}_{t}

is the Generalized Advantage Estimation (GAE), which quantifies the advantage of the current policy over a baseline policy for a given state-action pair. The policy probability ratio

r_{t} (θ)

is defined as the ratio of the probabilities under the new and old policies, as given by Equation (10):

r_{t} (θ) = \frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{old}} (a_{t} | s_{t})}

(10)

where

π_{θ} (a_{t} | s_{t})

represents the probability of the new policy selecting the action

a_{t}

in state

s_{t}

, and

π_{θ_{old}} (a_{t} | s_{t})

denotes the corresponding probability under the old policy. The function

clip (r_{t} (θ), 1 - ϵ, 1 + ϵ)

refers to the clipping operation, which constrains the policy ratio to the interval

[1 - ϵ, 1 + ϵ]

to prevent excessive deviation in policy updates. The Generalized Advantage Estimation (GAE) is defined by Equation (11).

A_{t}^{G A E (γ, λ)} = \sum_{l = 0}^{\infty} {(γ λ)}^{l} δ_{t + l}

(11)

where γ denotes the discount factor, and λ is the GAE coefficient, which adjusts the trade-off between long-term and short-term rewards. The TD error

δ_{t + l}

is expressed by Equation (12).

δ_{t + l} = r_{t + l} + γ V_{ϕ} (s_{t + l + 1}) - V_{ϕ} (s_{t + l})

(12)

V_{ϕ} (s_{t})

represents the estimated state-value function, and

ϕ

denotes the parameters of the value function network. The policy parameters of the PPO algorithm are updated using Equation (13).

θ \leftarrow \underset{θ}{\arg \max} (E [L^{C L I P} (θ)])

(13)

3.4. Random Network Distillation

The Random Network Distillation (RND) algorithm is a curiosity-driven unsupervised exploration method. It generates additional intrinsic reward by assessing the novelty of states encountered by the agent within the environment, thereby effectively guiding the agent to explore unknown state spaces. RND comprises two deep neural networks: a randomly initialized and fixed target network, and a predictor network that is continuously updated through training, as illustrated in Figure 4.

For a given state

s_{t}

, the intrinsic reward

r_{t}^{int}

is defined as the Mean Squared Error (MSE) between the outputs of the target network and the predictor network, as given by Equation (14).

r_{t}^{int} (s_{t}) = ‖ {\hat{f}}_{ψ} (s_{t}) - f (s_{t}) ‖_{2}^{2}

(14)

where

f (s_{t})

denotes the random representation of the state

s_{t}

generated by the target network, and

{\hat{f}}_{ψ} (s_{t})

represents the learned representation of the same state produced by the predictor network, with its parameters

ψ

being updated over time. The predictor network is trained by optimizing the loss function defined in Equation (15).

L^{R N D} (ψ) = E_{s_{t} \sim D} [‖ {\hat{f}}_{ψ} (s_{t}) - f (s_{t}) ‖_{2}^{2}]

(15)

where

D

denotes the experience replay buffer for states, and

ψ

represents the parameters of the predictor network. By minimizing this loss function, the predictor network continuously improves its ability to approximate the output of the target network, thereby dynamically adapting the recognition of novelty in environmental states. In practical applications, the intrinsic reward is typically combined with the extrinsic reward

r_{t}^{e x t} (s_{t}, a_{t})

to guide policy updates, as given by Equation (16).

r_{t}^{t o t a l} (s_{t}, a_{t}) = r_{t}^{e x t} (s_{t}, a_{t}) + β \cdot r_{t}^{i n t} (s_{t})

(16)

where

β

denotes the weighting factor for the intrinsic reward, which balances the intrinsic and extrinsic rewards. In this study, β is no longer a constant but is adaptively adjusted based on the extent of exploration. A dimensionless exploration metric is first constructed using the reward difference, as given by Equation (17).

ξ_{t} = c l i p (\frac{|R_{t} - R_{t - 1}|}{C}, 0, 1)

(17)

The variable

\tanh

is then smoothly mapped to the preset interval

[β_{\min}, β_{\max}]

, as given by Equation (18).

β_{t} = β_{\min} + (β_{\max} - β_{\min}) \cdot (\tanh ξ_{t} + 1)

(18)

where

C

represents the normalization scale factor. This design automatically increases the weight of intrinsic reward to enhance exploration when extrinsic reward undergoes dramatic changes, and reduces the weight of intrinsic reward to promote behavioral convergence when the reward stabilizes.

4. Analysis of the Stability

Consider the nominal model of the PMLSM given by Equation (19).

m \ddot{x} (t) + b \dot{x} (t) = u (t) + d (t), m > 0, b \geq 0, |d (t) \leq \bar{d}|

(19)

The desired trajectory

y_{d} (t)

and its derivatives are bounded. The tracking error is defined as

e (t) = y_{d} (t) - x (t)

. To explicitly constrain overshoot, convergence rate, and steady-state error over the entire time domain, a prescribed performance boundary is given by Equation (20).

ρ (t) = (ρ_{0} - ρ_{\infty}) e^{- λ t} + ρ_{\infty}, ρ_{0} > |e (0)| > ρ_{\infty} \geq 0

(20)

The transformed error is then subjected to a monotonic bijective mapping, as expressed in Equations (21) and (22).

γ (t) = \frac{e (t)}{ρ (t)} \in (- 1, 1)

(21)

ξ (t) = Φ (γ) = \ln \frac{δ_{1} + γ}{δ_{1} (1 - γ)}, δ_{1} > 0

(22)

The implementation employs numerical truncation via

| γ | \leq 1 - ε

(0 < ε ≪ 1)

to ensure the boundedness of

Φ

,

Φ^{'}

and associated gains. Equations (23) and (24) are derived through the chain rule.

\dot{ξ} = ψ (γ, t) (\dot{e} - \frac{\dot{ρ}}{ρ} e)

(23)

ψ (γ, t) = \frac{Φ^{'} (γ)}{ρ (t)}, 0 < \underline{ψ} \leq ψ (γ, t) \leq \bar{ψ} < \infty

(24)

Using a PID control law in the

ξ

coordinates, the control input is defined as Equation (25), where

I_{ξ} (t)

is given by Equation (26),

τ (t) = s a t (K_{p} (s) ξ (t) + K_{i} (s) I_{ξ} (t) + K_{d} (s) \dot{ξ} (t))

(25)

I_{ξ} (t) = \int_{0}^{t} ξ (τ) d τ

(26)

During the deployment phase, the policy network (Actor) is frozen, and only its mean branch is retained for inference. The output PID gains

[K_{p}, K_{i}, K_{d}]

are bounded within their defined domains and satisfy Lipschitz continuity with respect to the state

s

, as formally expressed in Equation (27).

K_{•} (s) \in [K_{•}^{\min}, K_{•}^{\max}], ‖K (s_{1}) - K (s_{2})‖ \leq L_{K} ‖s_{1} - s_{2}‖

(27)

The selected smooth saturation operator

τ = τ_{\max} \frac{2}{π} \arctan (\frac{π}{2} \frac{τ_{raw}}{τ_{\max}})

is globally Lipschitz and sector-bounded, with smooth clipping

|τ (t)| \leq τ_{\max}

. The integration channel incorporates clipping, thereby ensuring

|I_{ξ} (t)| \leq \bar{I}

. Under these regularity conditions, a quadratic Lyapunov function given by Equation (28) is adopted.

V (ξ, \dot{ξ}, I_{ξ}) = \frac{1}{2} ξ^{2} + \frac{α}{2} {\dot{ξ}}^{2} + \frac{β}{2} I_{ξ}^{2}, α > 0, β > 0

(28)

By differentiating along the closed-loop trajectory and utilizing the object dynamics

m \ddot{x} (t) + b \dot{x} (t) = τ + d

and

\ddot{e} = {\ddot{y}}_{d} - \frac{1}{m} (τ + d) - \frac{b}{m} \dot{x}

, then eliminating the second-order terms through the linear relationships of

\dot{ξ}

and

\ddot{ξ}

with respect to

\dot{e}

and

\ddot{e}

,

\dot{V}

can be expressed as a quadratic form in

ξ, \dot{ξ}, I_{ξ}

, linearly combined with the bounded external excitation

({\dot{y}}_{d}, {\ddot{y}}_{d}, d)

. Since

s a t (\cdot)

is sector-bounded,

K_{p}, K_{i}, K_{d}

is bounded and smooth, and

ψ, \dot{ψ}

is bounded,

|I_{ξ} (t)| \leq \bar{I}

, an appropriate

α, β

can be selected to render the dominant quadratic form negative definite, thereby ensuring the existence of a constant

c_{1}, c_{2} > 0

such that Equation (29) holds.

\dot{V} \leq - c_{1} (ξ^{2} + {\dot{ξ}}^{2} + I_{ξ}^{2}) + c_{2} ({\ddot{y}}_{d}^{2} + {\dot{y}}_{d}^{2} + d^{2})

(29)

Based on the comparison lemma,

(ξ, \dot{ξ}, I_{ξ})

is uniformly ultimately bounded UUB. Furthermore, since

Φ

is a strictly increasing bijection, the boundedness of

ξ

is equivalent to that of

|γ| < 1

, thus yielding Equation (30).

| e (t) | < ρ (t) \forall t \geq 0, \underset{t \to \infty}{\lim \sup} | e (t) | \leq ρ_{\infty}

(30)

This demonstrates that under the conditions of post-RL training (with the Actor frozen) and satisfaction of the aforementioned implementation constraints—bounded gains, smooth saturation operators, integration channel clipping, and transformed error truncation—the closed-loop system strictly adheres to the prescribed performance funnel defined by PPC, achieving practical stability and disturbance robustness. If the viscous damping is neglected (

b = 0

), the above derivations remain valid as a special case, as

b > 0

only introduces an additional passive dissipation term that enhances the margin of negative definiteness.

5. Experiment

5.1. Experiment Platform

To validate the real-time control performance of the proposed method, the same hardware-in-the-loop (HIL) real-time simulation platform as in ref. [40] is employed. In this setup, the control algorithm runs on the real-time simulator, with communication and delay simulated within the loop, while the controlled plant is an actual PMSLM precision motion platform, as shown in Figure 5. The platform consists of a host monitoring computer, a real-time simulator, an FPGA interface module, analog output modules, a precision driver, and a motion stage. The host computer and real-time simulator exchange control commands and sensor data efficiently via high-speed IP communication. The real-time simulator executes algorithms and manages real-time control, converting digital control signals into analog current signals through the analog output modules. The precision driver subsequently regulates the motion of the platform based on these signals. High-resolution optical encoder position data is preprocessed by the FPGA and fed back to the real-time simulator, forming a closed-loop control system. The platform demonstrates efficient real-time data processing and low-latency communication capabilities, with the controller sampling frequency set to 5 kHz. Detailed hardware specifications are summarized in Table 1.

The reference trajectory adopts a single-frequency sinusoidal function

y_{d} (t) = A \sin (2 π f t)

, where the amplitude is

A = 0.02 m

and the frequency is

f = 10 Hz

.

The hyperparameters of the PPO and RND modules were determined through systematic tuning to ensure stable convergence and optimal tracking performance. Within the PPC performance boundary,

ρ_{0} = 5.0 \times 10^{- 3} m

,

ρ_{\infty} = 1.0 \times 10^{- 4} m

,

K = 2 s^{- 1}

, the parameter

δ_{1} = 0.5

is selected, and a slight truncation is applied to

γ

in the implementation as

γ \in [- δ_{1} + ε, 1 - ε] (ε = 10^{- 6})

. The value

δ_{2} = 1

is adopted as the upper boundary constant in the implementation. The Actor’s action vector

a = {[a_{p}, a_{i}, a_{d}]}^{T} \in {[0.5, 1.0]}^{3}

directly scales the baseline gains to generate physical gains as

K_{p} = a_{p} K_{p}^{i n i t}

,

K_{i} = a_{i} K_{i}^{i n i t}

,

K_{d} = a_{d} K_{d}^{i n i t}

, where

K_{p}^{i n i t} = 1.5 \times 10^{5} N

,

K_{p}^{i n i t} = 1.5 \times 10^{3} N / s

,

K_{d}^{i n i t} = 7.5 \times 10^{3} N \cdot s

. A smooth saturation function is applied at the execution stage to constrain the magnitude of the actual control force

τ

, with a saturation upper limit of

τ_{\max} = 300 N

. This limit is derived from the motor current limit via

τ_{\max} = k_{t} i_{\max}

. During the deployment and evaluation phase, only the mean head output of the Actor is utilized (without sampling). The Adam optimizer is employed, with the Actor’s learning rate set to

5 \times 10^{- 4}

, L2 regularization factor to

1 \times 10^{- 5}

, and gradient clipping to 1; the Critic’s learning rate is set to

2 \times 10^{- 3}

, L2 regularization factor to

2 \times 10^{- 4}

, and gradient clipping to 1. In PPO, the discount factor

γ

is set to 0.99, GAE factor

λ

to 0.95, clipping factor to 0.2, entropy coefficient to 0.01, rollout horizon to 512, minibatch size to 1024, and the maximum number of training epochs to 100. For RND, the predictor learning rate is set to

1 \times 10^{- 3}

.

The overall signal flow and real-time computation process of the proposed control system are depicted in Figure 6. The dashed box encompasses the controller computing section, which sequentially executes A/D sampling, state computation, PPO-RND parameter inference, PPC-PID control law calculation, and D/A output within the 0.2 ms sampling period. The driver power amplification and motor dynamic response occur outside this digital control loop.

5.2. Reward

The reward function serves as the most critical evaluation metric in reinforcement learning algorithms, directly determining the training’s objective orientation and convergence speed. This study adopts a combined reward approach, incorporating both extrinsic and intrinsic rewards. The absolute value of the system’s current position error is utilized as the foundational metric for the extrinsic reward. Denoting the current position error as

e_{p o s} (t) = y_{d} (t) - x (t)

, the extrinsic reward is defined as the negative absolute value of the error, as given by Equation (31).

R_{e x t} (t) = - |e_{p o s} (t)|

(31)

A reward value closer to zero indicates superior controller performance and higher tracking accuracy. To enhance early-stage exploration efficiency and improve convergence under complex dynamic conditions, the RND-based intrinsic reward described in Section 3.3 is introduced, incentivizing the agent to explore novel states more extensively within the state space and strengthening the agent’s proactive exploration capability. The composite reward ultimately used for training is given by Equation (32).

R (t) = R_{e x t} (t) + β_{t} \cdot R_{i n t} (t)

(32)

where

β_{t}

denotes the adaptive weighting factor for the intrinsic reward, which is dynamically adjusted based on exploration metrics derived from reward variations, thereby balancing the trade-off between intrinsic exploration and extrinsic objectives.

5.3. Experiment Results

This section comprises two experimental components:

First, the training reward curves of different methods are analyzed to evaluate convergence speed and stability.
Second, control error metrics of various methods under identical reference trajectories are compared to validate the tracking performance of the proposed approach.

All experiments are conducted under consistent simulation configurations and trajectory conditions. During the deployment phase, methods incorporating PPO utilize the mean output of the Actor network for inference.

5.3.1. Reward Curves

To evaluate the impact of different designs on the policy learning process, this section compares the average reward curves during the training phase across four PPO-based methods: PPO-PID, PPO-RND-PID, PPO-PPC-PID, and PPO-RND-PPC-PID. The training environment, reference trajectory, network architecture, and parameters remain consistent across all methods. The mean reward per episode is calculated as the “average reward” to characterize convergence speed and stability. To reduce visual noise, the curves are smoothed using a moving average with a window size of 10. Given the differences in reward magnitudes among the methods, Figure 7a displays the average reward curves for PPO-PID and PPO-RND-PID, while Figure 7b presents those for PPO-PPC-PID and PPO-RND-PPC-PID, to avoid scale compression and interpretation bias caused by dual-axis representation.

5.3.2. System Performance

To quantify the tracking performance of the motion platform, this paper adopts two types of error-based metrics: (1) the Mean Absolute Error (MAE), (2) the Root Mean Square Error (RMSE), 3) the 95th Percentile of the Absolute Error

P_{95} (|e|)

and the Steady-State Per-Cycle Root-Mean-Square Error (Per-cycle RMS (SS)).

Under discrete sampling conditions, the error at the

k

-th time step is denoted as

e_{k}

, with the total number of samples being

N

. The evaluation metrics are defined as follows:

The Mean Absolute Error (MAE) metric reflects the average deviation of the system throughout its entire operation, intuitively gauging the overall tracking accuracy, as given by Equation (33).

$MAE = \frac{1}{N} \sum_{k = 1}^{N} |e_{k}|$

(33)
The Root Mean Square Error (RMSE), which is more sensitive to large-magnitude errors, reflects the system’s control capability under extreme deviation conditions, as given by Equation (34).

$RMSE = \sqrt{\frac{1}{N} \sum_{k = 1}^{N} e_{k}^{2}}$

(34)
The metric $P_{95} (|e|)$ represents an upper error bound at a high confidence level. It quantifies the maximum error within which the controller operates for the vast majority (95th percentile) of the time, reflecting the worst-case precision of the system under normal operating conditions. A smaller $P_{95} (|e|)$ indicates that most of the time the controller maintains a low tracking deviation, as given by Equation (35).

$P_{95} (| e |) = percentile (| e (t) |, 95 %)$

(35)
The steady-state per-cycle root-mean-square error evaluates the periodic tracking consistency of the system during steady operation. For a reference trajectory with period T, the RMS error of the k-th cycle is computed as given by Equation (36).

$E_{RMS}^{(k)} = \sqrt{\frac{1}{T} \int_{(k - 1) T}^{k T} e^{2} (t) d t}$

(36)

and the steady-state performance is expressed as the mean and standard deviation of the last $K_{s s}$ cycles, as given by Equations (37) and (38).

$E_{RMS}^{(k)} = \sqrt{\frac{1}{T} \int_{(k - 1) T}^{k T} e^{2} (t) d t}$

(37)

$σ_{E_{RMS}^{(SS)}} = \sqrt{\frac{1}{K_{ss} - 1} \sum_{k = K - K_{ss} + 1}^{K} {(E_{RMS}^{(k)} - {\bar{E}}_{RMS}^{(SS)})}^{2}}$

(38)

where T is the period of the reference trajectory, $E_{RMS}^{(k)}$ denotes the RMS error in the k-th cycle, and $K_{s s}$ is the number of cycles used for steady-state evaluation. A smaller ${\bar{E}}_{RMS}^{(SS)}$ and $σ_{E_{RMS}^{(SS)}}$ indicate better steady-state precision and cycle-to-cycle stability. Under identical reference trajectories and simulation conditions, a comparative analysis of closed-loop tracking performance was conducted across six methods (PID, PID-PPC, PPO-PID, PPO-PPC-PID, PPO-RND-PID, PPO-RND-PPC-PID), with a focus on evaluating their steady-state error. To provide an intuitive visual comparison, Figure 8 contrasts the tracked trajectories of the basic PID controller and the proposed PPO-RND-PPC-PID controller against the reference path, clearly illustrating the improvement in tracking accuracy. The control error trajectories over time are illustrated in Figure 9, while the error statistics (MAE and RMSE) within the steady-state phase are summarized in Table 2.

The results demonstrate that the proposed PPO-RND-PPC-PID method achieves optimal error performance across all scenarios, with MAE = 0.135 and RMSE = 0.154. Compared to the traditional PID controller, it reduces MAE and RMSE by approximately 70.7% and 69.9%, respectively. Relative to the PPC-PID method (which only introduces PPC, it reduces errors by about 11.8% (MAE) and 10.5% (RMSE). Even when compared to PPO-PPC-PID, it achieves reductions of 10.6% in MAE and 9.4% in RMSE. These improvements indicate that, within the PPC performance constraint framework, the integration of PPO’s parameter adaptation and RND’s robust exploration capability achieves a superior balance between rapid response and steady-state error. This synergy significantly reduces overall error levels, thereby enhancing tracking precision and stability. Moreover, the ninety-five percent absolute error P₉₅ decreases from 1.687 (PID) and 3.151 (PPO-PPC-PID) to 0.844, while the steady-state per-cycle RMS error drops from 0.543 ± 0.00374 and 0.253 ± 0.353 to 0.158 ± 0.0148, corresponding to reductions of approximately 49.9% and 73.2% in P₉₅, and 70.9% and 37.6% in per-cycle RMS, respectively. All the above values are expressed in units of

10^{- 3} m

.

As illustrated in Figure 10, robustness was further evaluated under a strong and sustained disturbance scenario, where a 300 N external force was applied continuously for 50 sampling cycles, starting at

t = 1.5 s

. Both controllers exhibited similar transient peak deviations, with the fixed-gain PID reaching 1.53 mm and the proposed PPO-RND-PPC-PID showing 1.55 mm, indicating comparable instantaneous disturbance responses. However, the proposed controller restored steady tracking within 0.018 s, whereas the fixed-gain PID required 0.163 s, representing a reduction of 89% in recovery time. These results confirm that the proposed method maintains equivalent transient accuracy while achieving significantly faster recovery and stronger disturbance rejection under high-magnitude external forces.

The observed improvements result from the synergy of PPC-based error constraints, PPO-driven parameter adaptation, and RND-induced exploration, which collectively contribute to faster convergence, quicker recovery from disturbances, and lower steady-state error. Furthermore, the PPO-RND inference occupies approximately 40–50% of the sampling period, and the overall control loop still operates comfortably within real-time limits, meeting the stringent timing requirements of high-precision motion control.

6. Conclusions

This paper proposes an adaptive β-PPO-RND online tuning framework under PPC constraints. Experimental validation on a PMLSM platform demonstrates its capability to significantly suppress steady-state error, reduce manual tuning cost, and maintain favorable dynamic performance and fundamental robustness. The method integrates PPC as a performance prior and leverages adaptive balancing between intrinsic and extrinsic rewards to drive efficient exploration, ensuring practical operability in engineering applications. Future research will be aimed at addressing the current framework’s limitations by improving the theoretical guarantees for stability and performance boundaries under the PPC framework, enhancing computational efficiency through pre-training and fine-tuning strategies. And exploring lightweight implementations with hardware acceleration to mitigate the computational burden of the RL inference process. These advancements aim to extend the framework’s applicability to broader semiconductor manufacturing scenarios.

Author Contributions

Conceptualization, Y.W.; methodology, Y.W. and B.L.; software, Y.W.; validation, J.X. (Jingchong Xu); data curation, K.G.; writing—original draft, J.W.; writing—review and editing, J.X. (Jianping Xing); visualization, S.B.; supervision, B.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

Authors Jingchong Xu, Kaina Gao, Junjie Wang, Shi Bu, and Bin Liu were employed by the company China Electronics Technology Group Corporation. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Kailath, T.; Schaper, C.; Cho, Y.; Gyugyi, P.; Norman, S.; Park, P.; Boyd, S.; Franklin, G.; Saraswat, K.; Moslehi, M.; et al. (Eds.) Control for advanced semiconductor device manufacturing: A case history. In Control System Applications; CRC Press: Boca Raton, FL, USA, 2019; pp. 67–83. [Google Scholar]
Song, F.; Liu, Y.; Dong, Y.; Chen, X.; Tan, J. Motion control of wafer scanners in lithography systems: From setpoint generation to multistage coordination. IEEE Trans. Instrum. Meas. 2024, 73, 7508040. [Google Scholar] [CrossRef]
Lopez-Sanchez, I.; Moreno-Valenzuela, J. PID control of quadrotor UAVs: A survey. Annu. Rev. Control 2023, 56, 100900. [Google Scholar] [CrossRef]
Chai, T.; Zhou, Z.; Cheng, S.; Jia, Y.; Song, Y. Industrial metaverse-based intelligent PID optimal tuning system for complex industrial processes. IEEE Trans. Cybern. 2024, 54, 6458–6470. [Google Scholar] [CrossRef] [PubMed]
Williams, M.E. Precision Six Degrees of Freedom Magnetically-Levitated Photolithography Stage. Master’s Thesis, Massachusetts Institute of Technology, Cambridge, UK, 1998. [Google Scholar]
Butler, H. Position control in lithographic equipment: An enabler for current-day chip manufacturing. IEEE Control Syst. 2011, 31, 28–47. [Google Scholar] [CrossRef]
Zhang, L.; Zhang, P.; Jiang, B.; Yan, H. Research trends in methods for controlling macro-micro motion platforms. Nanotechnol. Precis. Eng. 2023, 6, 035001. [Google Scholar] [CrossRef]
Liu, J.; Yu, H.; Wang, X. Fuzzy PID control of parallel 6-DOF motion platform. In Proceedings of the 2023 5th International Conference on Intelligent Control, Measurement and Signal Processing (ICMSP), Chengdu, China, 19–21 May 2023. [Google Scholar]
Liu, H.; Zeng, Z.; Yang, X.; Zou, Y.; Liu, X. A control strategy for shipboard stabilization platforms based on a fuzzy adaptive proportional–integral–derivative (PID) control architecture. Mech. Sci. 2025, 16, 325–342. [Google Scholar] [CrossRef]
Tian, H.; An, J.; Ma, H.; Tian, B. Trajectory planning and adaptive fuzzy PID control for precision in robotic vertebral plate cutting: Addressing force dynamics and deformation challenges. J. Intell. Robot. Syst. Theory Appl. 2024, 110, 161. [Google Scholar] [CrossRef]
Noordin, A.; Mohd Basri, M.A.; Mohamed, Z. Adaptive PID control via sliding mode for position tracking of quadrotor MAV: Simulation and real-time experiment evaluation. Aerospace 2023, 10, 512. [Google Scholar] [CrossRef]
Zhou, W.; Liu, H.; He, H. Research on genetic algorithm for feedforward parameter tuning in XY motion platform. In Proceedings of the 2023 24th International Conference on Electronic Packaging Technology (ICEPT), Shihezi, China, 8–11 August 2023. [Google Scholar]
Aboud, H.; Amouri, A.; Cherfia, A.; Bouchelaghem, A.M. Fractional-order PID controller tuned by particle swarm optimization algorithm for a planar CDPR control. Indones. J. Electr. Eng. Comput. Sci. 2024, 33, 1500–1510. [Google Scholar] [CrossRef]
Sabo, A.; Bawa, M.; Yakubu, Y.; Ngyarmunta, A.A.; Aliyu, Y.; Musa, A.; Katun, M. PID Controller Tuning for an AVR System Using Particle Swarm Optimisation Techniques and Genetic Algorithm Techniques: A Comparison-Based Approach. Vokasi Unesa Bull. Eng. Technol. Appl. Sci. 2025, 2, 270–280. [Google Scholar] [CrossRef]
Ramadhani, N.; Ma’arif, A.; Çakan, A. Implementation of PID control for angular position control of dynamixel servo motor. Control Syst. Optim. Lett. 2024, 2, 8–14. [Google Scholar] [CrossRef]
Li, D.; Wang, S.; Song, X.; Zheng, Z.; Tao, W.; Che, J. A BP-neural-network-based PID control algorithm of shipborne stewart platform for wave compensation. J. Mar. Sci. Eng. 2024, 12, 2160. [Google Scholar] [CrossRef]
Ghasemi, M.; Rahimnejad, A.; Gil, M.; Akbari, E.; Gadsden, S.A. A self-competitive mutation strategy for differential evolution algorithms with applications to proportional–integral–derivative controllers and automatic voltage regulator systems. Decis. Anal. J. 2023, 7, 100205. [Google Scholar] [CrossRef]
Liu, H.; Yu, Q.; Wu, Q. PID control model based on back propagation neural network optimized by adversarial learning-based grey wolf optimization. Appl. Sci. 2023, 13, 4767. [Google Scholar] [CrossRef]
Yang, J.; Zhou, Q.; Xie, B. Application research of intelligent PID parameter optimization algorithm for XY linear motor motion platform in wire bonder. In Proceedings of the 2024 25th International Conference on Electronic Packaging Technology (ICEPT), Tianjin, China, 7–9 August 2024. [Google Scholar]
Azeez, M.I.; Abdelhaleem, A.M.M.; Elnaggar, S.; Moustafa, K.A.F.; Atia, K.R. Optimization of PID trajectory tracking controller for a 3-DOF robotic manipulator using enhanced artificial bee colony algorithm. Sci. Rep. 2023, 13, 11164. [Google Scholar] [CrossRef]
Suid, M.H.; Ahmad, M.A. Optimal Tuning of Sigmoid PID Controller Using Nonlinear Sine Cosine Algorithm for the Automatic Voltage Regulator System. ISA Trans. 2022, 128 Pt B, 265–286. [Google Scholar] [CrossRef]
Tumari, M.Z.; Ahmad, M.A.; Suid, M.H.; Ghazali, M.R.; Tokhi, M.O. An Improved Marine Predators Algorithm Tuned Data-Driven Multiple-Node Hormone Regulation Neuroendocrine-PID Controller for Multi-Input–Multi-Output Gantry Crane System. J. Low Freq. Noise Vib. Act. Control 2023, 42, 1666–1698. [Google Scholar] [CrossRef]
Li, L. A novel optimization control of gas turbine based on a hybrid method using the belbic and adaptive multi input multi output feedback control. J. Intell. Fuzzy Syst. 2023, 45, 863–876. [Google Scholar] [CrossRef]
Basil, N.; Marhoon, H.M.; Mohammed, A.F. Evaluation of a 3-DOF helicopter dynamic control model using FOPID controller-based three optimization algorithms. Int. J. Inf. Technol. 2024, 16, 1–10. [Google Scholar] [CrossRef]
Mok, R.H.; Ahmad, M.A. Fast and optimal tuning of fractional order PID controller for AVR system based on memorizable-smoothed functional algorithm. Eng. Sci. Technol. Int. J. 2022, 35, 101264. [Google Scholar] [CrossRef]
Yonezawa, H.; Yonezawa, A.; Kajiwara, I. Experimental verification of active oscillation controller for vehicle drivetrain with backlash nonlinearity based on norm-limited SPSA. Proc. Inst. Mech. Eng. Part K J. Multi-Body Dyn. 2024, 238, 134–149. [Google Scholar] [CrossRef]
Ma, D.; Chen, X.; Ma, W.; Zheng, H.; Qu, F. Neural network model-based reinforcement learning control for AUV 3-D path following. IEEE Trans. Intell. Veh. 2024, 9, 893–904. [Google Scholar] [CrossRef]
Blais, M.A.; Akhloufi, M.A. Reinforcement learning for swarm robotics: An overview of applications, algorithms and simulators. Cogn. Robot. 2023, 3, 226–256. [Google Scholar] [CrossRef]
Yan, L.; Liu, Z.; Chen, C.L.P.; Zhang, Y.; Wu, Z. Reinforcement learning based adaptive optimal control for constrained nonlinear system via a novel state-dependent transformation. ISA Trans. 2023, 133, 29–41. [Google Scholar] [CrossRef] [PubMed]
Wang, Z.; Li, Y.; Ma, C.; Yan, X.; Jiang, D. Path-following optimal control of autonomous underwater vehicle based on deep reinforcement learning. Ocean Eng. 2023, 268, 113407. [Google Scholar] [CrossRef]
Del Rio, A.; Jimenez, D.; Serrano, J. Comparative analysis of A3C and PPO algorithms in reinforcement learning: A survey on general environments. IEEE Access 2024, 12, 146795–146806. [Google Scholar] [CrossRef]
Raeisi, M.; Sesay, A.B. Power control of 5G-connected vehicular network using PPO-based deep reinforcement learning algorithm. IEEE Access 2024, 12, 96387–96403. [Google Scholar] [CrossRef]
Zhou, Z.; Mo, F.; Zhao, K.; Hao, Y.; Qian, Y. Adaptive PID control algorithm based on PPO. J. Syst. Simul. 2024, 36, 1425–1432. [Google Scholar] [CrossRef]
Zhang, J.; Rivera, C.E.O.; Tyni, K.; Nguyen, S. AirPilot: Interpretable PPO-based DRL auto-tuned nonlinear PID drone controller for robust autonomous flights. arXiv 2024, arXiv:2404.00204. [Google Scholar] [CrossRef]
Mikhaylova, E.; Makarov, I. Curiosity-driven exploration in VizDoom. In Proceedings of the 2022 IEEE 20th Jubilee International Symposium on Intelligent Systems and Informatics (SISY), Subotica, Serbia, 15–17 September 2022. [Google Scholar]
Zhang, X.; Dai, R.; Chen, W.; Qiu, J. Self-supervised network distillation for exploration. Int. J. Pattern Recognit. Artif. Intell. 2023, 37, 2351021. [Google Scholar] [CrossRef]
Zhang, G.; Xing, Y.; Zhang, W.; Li, J. Prescribed performance control for USV-UAV via a robust bounded compensating technique. IEEE Trans. Control Netw. Syst. 2025, in press. [Google Scholar] [CrossRef]
Sui, S.; Tong, S. Finite-time fuzzy adaptive PPC for nonstrict-feedback nonlinear MIMO systems. IEEE Trans. Cybern. 2023, 53, 732–742. [Google Scholar] [CrossRef] [PubMed]
Han, N.; Ren, X.; Zhang, C.; Zheng, D. Prescribed performance control with input indicator for robot system based on spectral normalized neural networks. Neurocomputing 2022, 492, 201–210. [Google Scholar] [CrossRef]
Wang, Y.; Wang, J.; Gao, K.; Xing, J.; Liu, B. High-precision dynamic tracking control method based on parallel GRU–transformer prediction and nonlinear PD feedforward compensation fusion. Mathematics 2025, 13, 2759. [Google Scholar] [CrossRef]
Bu, X. Prescribed performance control approaches, applications and challenges: A comprehensive survey. Asian J. Control 2023, 25, 241–261. [Google Scholar] [CrossRef]

Figure 1. Structure of the PMSLM used in this study.

Figure 2. Control architecture: PPO-RND with adaptive β optimizes PID online under PPC.

Figure 3. Actor–Critic network architecture.

Figure 4. Structure of the RND module.

Figure 5. Experimental setup of motion platform.

Figure 6. Overall signal flow of the proposed control system.

Figure 7. Training average reward curves of PPO-based methods. (a) PPO-PID vs. PPO-RND-PID. (b) PPO-PPC-PID vs. PPO-RND-PPC-PID.

Figure 8. Comparison of the tracked trajectory and reference path for the basic PID and proposed PPO-RND-PPC-PID controllers.

Figure 9. Closed-loop tracking error curves of different methods.

Figure 10. Disturbance response comparison between the PID and proposed PPO-RND-PPC-PID controllers.

Table 1. Experimental configuration.

Hardware	Parameter
Real-time simulation machine	CPU: Intel Core i9-9900KS@4.00GHz octa-core (Intel Corporation, Santa Clara, CA, USA), 8 GB DDR3 SDRAM, Operating system: RTOS
Analog output board	D/A conversion accuracy: 16-bit, voltage output: 16-channel, output range: +/−10 V, conversion rate: 25 V/μs, setup time: 2 μs
FPGA board	CPU: Kintex UltraScale XCKU040-2FFVA11561 (AMD, Santa Clara, CA, USA), Block RAM (Mb): 21.1, DSP Slices: 1920, DSP Slices: 1920
Sensor	Effective accuracy: 0.5 μm
Driver	DC: 240 V, AC: 100–240 V, IC: 6 A, IP: 18 A

Table 2. Comparative analysis of tracking performance among control algorithms.

Algorithm	MAE (×10⁻³ m)	RMSE (×10⁻³ m)	P₉₅ (×10⁻³ m)	Per-Cycle RMS (SS) (×10⁻³ m)
PID	0.461	0.512	1.687	0.543 ± 0.00374
PPO-PID	0.434	0.482	1.763	0.522 ± 0.0105
PPO-RND-PID	0.431	0.501	0.894	0.457 ± 0.00425
PPC-PID	0.153	0.172	3.250	0.274 ± 0.046
PPO-PPC-PID	0.151	0.170	3.151	0.253 ± 0.353
PPO-RND-PPC-PID	0.135	0.154	0.844	0.158 ± 0.0148

Note: Bold indicates the best result.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Y.; Xu, J.; Gao, K.; Wang, J.; Bu, S.; Liu, B.; Xing, J. Adaptive PPO-RND Optimization Within Prescribed Performance Control for High-Precision Motion Platforms. Mathematics 2025, 13, 3439. https://doi.org/10.3390/math13213439

AMA Style

Wang Y, Xu J, Gao K, Wang J, Bu S, Liu B, Xing J. Adaptive PPO-RND Optimization Within Prescribed Performance Control for High-Precision Motion Platforms. Mathematics. 2025; 13(21):3439. https://doi.org/10.3390/math13213439

Chicago/Turabian Style

Wang, Yimin, Jingchong Xu, Kaina Gao, Junjie Wang, Shi Bu, Bin Liu, and Jianping Xing. 2025. "Adaptive PPO-RND Optimization Within Prescribed Performance Control for High-Precision Motion Platforms" Mathematics 13, no. 21: 3439. https://doi.org/10.3390/math13213439

APA Style

Wang, Y., Xu, J., Gao, K., Wang, J., Bu, S., Liu, B., & Xing, J. (2025). Adaptive PPO-RND Optimization Within Prescribed Performance Control for High-Precision Motion Platforms. Mathematics, 13(21), 3439. https://doi.org/10.3390/math13213439

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Adaptive PPO-RND Optimization Within Prescribed Performance Control for High-Precision Motion Platforms

Abstract

1. Introduction

2. Modeling

3. Control Strategy

3.1. Based on the PPO-RND Prescribed Performance Control Architecture

3.2. The Structure of Prescribed Performance Control

3.3. Proximal Policy Optimization

3.4. Random Network Distillation

4. Analysis of the Stability

5. Experiment

5.1. Experiment Platform

5.2. Reward

5.3. Experiment Results

5.3.1. Reward Curves

5.3.2. System Performance

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI