Deep Reinforcement Learning for Variable Tension Control of Unmanned Underwater Vehicle Arresting Gear Under Nonlinear Effects

Wang, Xikun; Li, Weijia; Huang, Junlei; Liu, Fayou

doi:10.3390/machines14060654

Open AccessArticle

Deep Reinforcement Learning for Variable Tension Control of Unmanned Underwater Vehicle Arresting Gear Under Nonlinear Effects

¹

School of Naval Architecture & Ocean Engineering, Huazhong University of Science and Technology, Wuhan 430074, China

²

Hubei Provincial Key Laboratory of Hydrodynamics, Wuhan 430074, China

³

Wuhan Institute of Shipbuilding Technology, Wuhan 430050, China

^*

Author to whom correspondence should be addressed.

Machines 2026, 14(6), 654; https://doi.org/10.3390/machines14060654

Submission received: 6 May 2026 / Revised: 31 May 2026 / Accepted: 1 June 2026 / Published: 4 June 2026

(This article belongs to the Section Automation and Control Systems)

Download

Browse Figures

Versions Notes

Abstract

Large Unmanned Underwater Vehicles (UUVs) are playing an increasingly critical role in complex marine missions due to their enhanced payload and endurance capabilities. However, the safe recovery of these platforms remains a significant challenge, complicated by their high inertia, strong hydrodynamic interactions, and unpredictable environmental disturbances. In particular, the nonlinear coupling effects between the mechanical structure and the hydrodynamic environment exert a considerable influence on the system, accounting for nearly 50% of the tension on the arresting cable. To address these challenges, this paper proposes a variable tension control strategy for a UUV underwater arresting recovery system, utilizing a Well-Shaped Reward Entropy-regularized Proximal Policy Optimization (WSR-E-PPO) algorithm. In this framework, the real-time velocity and displacement of the UUV are utilized to represent the spatiotemporal characteristics of the recovery state, and a hybrid reward function integrating sparse and continuous rewards based on Potential-Based Reward Shaping (PBRS) is designed. Simulation results demonstrate that the proposed method enables the UUV to return to the docking point without oscillation, while effectively limiting the total recovery time to approximately 80 s—a 37.5% reduction compared with existing methods. Furthermore, the strategy ensures smoother tension regulation throughout the process. These findings provide a solid technical foundation and assurance for the stable and safe underwater recovery of large UUVs.

Keywords:

Unmanned Underwater Vehicle recovery; arresting cable; variable tension control strategy; proximal policy optimization method

1. Introduction

With the growing emphasis on marine resource exploration and utilization, Unmanned Underwater Vehicles (UUVs) have emerged as indispensable tools in the field of ocean engineering. Their high maneuverability, autonomy, and cost-effectiveness make them particularly suitable for performing a broad range of underwater operations, such as environmental monitoring, seabed mapping, infrastructure inspection, and salvage missions [1,2,3,4,5]. The recovery capability of a UUV is a key technological foundation for enabling energy replenishment in situ and secure data transfer. Critically, efficient recovery ensures sustained operational endurance and tactical stealth, which are essential for reliable long-term deployment in hostile environments. However, the intricate and dynamic nature of the underwater environment continues to impose significant technical challenges on the design and implementation of UUV recovery and recovery systems [6,7]. Consequently, the advancement of robust and efficient recovery and recovery technologies for UUVs has emerged as a focal point of current research and engineering efforts [8,9].

In consideration of the varying types of UUVs and the complexity of different marine environments, a series of UUV recovery systems has been systematically designed and developed by researchers. Each system incorporates a uniquely configured docking mechanism, optimized to meet the functional and environmental demands of its intended application [10]. Currently, two main methods are commonly adopted for UUV recovery: surface mother ship recovery and underwater recovery [11,12,13]. The implementation of underwater recovery platforms offers significant advantages, including the provision of multiple energy-replenishment modes and the facilitation of real-time or near-real-time data transfer, while effectively reducing both operational costs and application risks [14,15]. Currently, a variety of recovery-docking platforms has been developed through a process of diversification. Most of the platforms are configured as fixed installations on the seabed or as integral components of seabed observation stations. Meanwhile, other designs feature floating structures moored on the seabed, platforms towed by surface vessels, or systems stably integrated with larger submarines or UUVs, thereby meeting the recovery demands imposed by heterogeneous and complex marine environments [16,17]. And the surface mother-ship recovery mainly uses equipment such as cranes to lift UUVs that are located in target areas [18].

As UUVs continue to increase in size and mass, conventional recovery equipment has become insufficient to effectively withstand the considerable impact loads induced by their inherent large inertia. Consequently, these limitations manifest as pronounced deficiencies in the efficiency, stability, and reliability of existing UUV recovery operations. An arresting gear system is proposed and aimed at the problems of large inertia and underwater dynamic recovery during the recovery process [19]. During the recovery process, the nonlinear behavior of the recovery equipment, the complexity of the underwater environment, and the uncertainty associated with the UUV’s dynamic recovery posture, combined with the vehicle’s own state after engagement with the arresting mechanism, make it difficult for a single tension-control strategy for the arresting cable to achieve robust adaptability. These factors collectively constrain the overall recovery performance and operational efficiency of the system, often preventing the UUV from accurately returning to the designated docking position. In particular, the frictional interaction between the arresting cable and the arresting components exerts a pronounced influence on the system’s control characteristics, introducing significant variability in the vehicle’s trajectory toward the docking interface. As a result, the recovery efficiency is substantially diminished. As the requirements for positioning precision and trajectory tracking accuracy in motion control systems continue to intensify, the detrimental effects of friction on system dynamics have garnered significant scholarly and industrial interest. Against this backdrop, friction compensation control has evolved into a key research frontier within the global control engineering community. Investigations in this area primarily focus on two core aspects—accurate estimation of friction model parameters and the development of effective compensation strategies. Accordingly, existing methodologies can be systematically categorized into non-model-based compensation schemes and model-based compensation schemes, each with distinct theoretical foundations and practical implementations [20,21,22].

In advanced motion control, systems characterized by a low force-to-friction ratio—wherein nonlinear frictional forces are comparable in magnitude to the active control inputs—highlight the fundamental limitations of conventional control paradigms. In heavy-duty rigid systems, such as industrial hydraulics, structural friction can consume up to 30% to 50% of the maximum actuator torque [23]. Traditional model-based feedforward strategies (e.g., LuGre models) are often employed to counteract this. However, these methods are highly sensitive to parameter variations. In unpredictable environments, dynamic factors like temperature shifts and component wear inevitably cause severe model mismatches, causing feedforward loops to trigger limit-cycle oscillations. Furthermore, in flexible cable-driven systems, the Capstan effect exponentially amplifies friction across routing structures, leading to severe tension loss [24]. While analytical models can estimate this tension propagation in static geometries, they fail entirely during unconstrained operations like UUV retrieval. Unpredictable underwater hydrodynamics cause instantaneous fluctuations in cable contact angles and normal forces, rendering static analytical models mathematically intractable for capturing real-time transient friction boundaries. To address non-linear friction without precise models, intelligent control strategies have been explored, yet they face severe limitations under extreme boundaries. Early adaptive neural networks [25] operate under a “high force-friction ratio” assumption, treating friction as a minor perturbation. When applied to the violent Stribeck stick-slip transitions of heavy underwater recovery, their local adaptation mechanisms fail to converge rapidly, causing severe transient oscillations. Furthermore, while modern Deep Reinforcement Learning (DRL) [26,27] bypasses explicit modeling, standard agents expose critical vulnerabilities in friction-dominated environments. Hampered by naive noise-based exploration and sparse terminal rewards, they frequently suffer from exploration inefficiency and become trapped in friction-induced dead zones or suboptimal local minima.

Deep Reinforcement Learning (DRL), a rapidly evolving paradigm within artificial intelligence [28], synergistically combines the feature-extraction and perceptual strengths of Deep Learning (DL) with the sequential decision-making capabilities of Reinforcement Learning (RL). This integration provides a promising approach to solving the coupled perception and control problems inherent in complex dynamic systems [29,30]. In recent years, DRL has become a vibrant research frontier and has demonstrated remarkable applicability in diverse fields, ranging from robotics and autonomous navigation to energy management and industrial process optimization [31,32,33,34,35].

Building upon insights from existing studies, this paper investigates a tension control strategy for the UUV underwater arresting cable recovery system based on the Proximal Policy Optimization (PPO) algorithm. First, a dynamic model of the arresting cable system and the UUV is developed. Second, a nonlinear friction model is formulated based on the Stribeck effect. Finally, the exploratory mechanism of entropy is incorporated into the PPO framework, and a tailored reward function is designed to enable the intelligent agent to effectively mitigate nonlinear disturbances predominantly caused by friction. The main contributions of this work are as follows:

The dynamic models of the arresting cable system and the UUV were systematically established, while a nonlinear friction model derived from the Stribeck effect was constructed and subsequently embedded within the deep reinforcement learning environment to facilitate training and control strategy development.
In this study, the Proximal Policy Optimization (PPO) algorithm, grounded in the Actor–Critic framework, is employed as the core learning paradigm. Within this framework, generalized advantage estimation is combined with a clipping mechanism to effectively regulate the magnitude of gradient updates, thereby ensuring training stability and improving convergence properties. In alignment with the operational objectives of UUV recovery, a carefully designed reward structure—integrating both dense rewards and sparse terminal rewards is proposed to provide nuanced guidance throughout the learning process. Moreover, an entropy regularization term is incorporated into the objective function to promote sufficient exploration of the policy space and to alleviate the tendency of the learning agent to converge prematurely to suboptimal local optima.
To rigorously assess the effectiveness of the proposed WSR-E-PPO approach, a series of high-fidelity simulation studies were performed, in which its performance was systematically compared against five representative control strategies during the UUV retrieval phase—namely, the operational stage in which the vehicle is guided from its maximum arresting distance back to the designated docking interface. The comparative simulation results indicate that the WSR-E-PPO algorithm achieves superior performance, enabling the UUV to return to the docking point and come to a stable halt within a minimal time of 82 s. Moreover, the algorithm ensures a consistently smooth velocity trajectory and exhibits well-moderated tension variations in the arresting cable, thereby demonstrating both enhanced dynamic stability and improved recovery efficiency.

The remainder of this paper is structured as follows. Section 2 offers a comprehensive review of the existing literature, synthesizing prior work and highlighting the research gaps that motivate this study. Section 3 elaborates on the operational principles of the arresting-cable system and formulates the associated dynamic models that underpin the proposed control framework. Section 4 introduces the WSR-E-PPO methodology in detail, encompassing the algorithmic architecture, state and action representations, and the design rationale for the reward mechanism. Section 5 presents extensive simulation, together with a thorough evaluation and in-depth discussion of the observed results. Finally, Section 6 provides concluding remarks and outlines potential directions for future research.

2. Related Work

A substantial body of research has focused on the analysis and mitigation of nonlinear disturbances in control systems, with friction being one of the most critical and widely studied factors. To systematically review the existing literature and position the contributions of this study, this section is divided into conventional friction compensation methodologies and recent advancements in deep reinforcement learning (DRL) applied to mechanical systems.

2.1. Friction Compensation Methods

Over the past decades, scholars have developed two overarching classes of friction compensation methodologies: non model based and model based schemes. Within the non-model-based category, friction is typically regarded as an unstructured external disturbance. Control performance is improved by adaptively adjusting or reconfiguring control parameters, thereby enhancing the system’s disturbance rejection capability without relying on explicit friction models. This line of research has provided a foundation for robust control techniques that prioritize implementation simplicity and resilience under uncertain operating conditions [36]. A variety of advanced methodologies have been proposed to address friction-induced nonlinearities, including disturbance observer based schemes [37,38,39], neural network driven controllers [40,41,42], and nonlinear PID strategies [43]. For instance, Amthor et al. [44] developed a nonlinear state observer that enabled effective friction compensation in ultra high precision motion control at the nanometer scale. Furthermore, Lin and Li [45] introduced a fuzzy neural network based control approach, which demonstrated a remarkable reduction in tracking errors of linear piezoelectric motors when compared with adaptive sliding mode control. In contrast to non-model-based schemes, model-based friction compensation methods are distinguished by their strong specificity and direct utilization of frictional dynamics [46]. Classical static friction models such as the Coulomb, Stribeck, and Karnopp formulations are widely adopted to characterize frictional behavior under low and high speed conditions. To capture more intricate dynamics, a range of dynamic friction models has been proposed, including the LuGre [47], Leuven [48], Hsieh [49], and GMS models [50], which extend static descriptions by incorporating pre-slip dynamics and memory effects. Representative studies have demonstrated the effectiveness of such approaches: for instance, Kabziński and Jastrzębski [51] established a LuGre based friction model and coupled it with an adaptive control strategy, yielding significant reductions in tracking error relative to controllers without explicit friction compensation. Likewise, Zschack et al. [52] introduced an adaptive feedback mechanism founded on the GMS model, achieving nanometer level precision in motion control. These representative studies underscore the diversity of existing friction compensation techniques and highlight the ongoing efforts to enhance accuracy and robustness in high-precision control systems. Conventional control strategies that rely heavily on high-fidelity system models often exhibit substantial performance degradation when confronted with friction-induced disturbances, making it difficult to simultaneously ensure high precision and robustness. To alleviate this issue, non-model-based friction compensation schemes have been widely explored; these methods typically rely on empirically tuned rules or fixed gain structures. While they offer the advantage of not requiring explicit friction models, their capacity to capture the complex nonlinear and coupled dynamics of friction remains inherently limited.

2.2. Deep Reinforcement Learning in Friction Compensation

Recent advances in deep reinforcement learning (DRL) have opened up new avenues for friction compensation. By leveraging its model-free nature, universal function-approximation capability through deep neural architectures, and adaptive policy improvement via continual interaction with the environment, DRL demonstrates a superior ability to accommodate the complex, time-varying, and highly uncertain dynamics induced by friction. This paradigm shift enables the design of control strategies that are not only robust to severe nonlinear disturbances but also capable of approaching globally optimal performance, thereby representing a promising direction for future high-precision control applications. In the current research, DRL methodologies are generally categorized into two principal paradigms: value-function-based algorithms and policy based algorithms. The former focuses on estimating the expected return of states or state–action pairs to derive optimal decision policies, while the latter directly optimizes the policy itself, often enabling more flexible handling of high-dimensional or continuous action spaces. This fundamental dichotomy provides the theoretical basis for a wide range of subsequent algorithmic developments and hybrid frameworks in the DRL domain [53,54]. A seminal contribution in this area was made by Mnih et al. [28], who introduced the Deep Q-Network (DQN), a framework that synergistically combines deep neural networks with classical Q-learning. By leveraging convolutional neural networks to approximate value functions, the DQN significantly enhanced the capability of reinforcement learning to handle high-dimensional state spaces. To address the strong correlation among sequentially sampled transitions, the DQN incorporates two critical mechanisms: an experience replay buffer that randomizes training samples, and a target network that stabilizes the learning process by providing slowly updated target values [55]. Building upon this foundation, Van Hasselt et al. [56] proposed the Double Deep Q-Network (Double DQN), which mitigates the overestimation bias inherent in standard DQN training. In this approach, the next action is determined via a greedy selection from the online Q-network, while its value is evaluated through the target network, thereby decoupling action selection from action evaluation. In addition to value-function-based methods, DRL also encompasses policy based approaches, whose core idea is to employ deep neural networks to represent the policy in reinforcement learning. The Deep Deterministic Policy Gradient (DDPG) algorithm integrates the deterministic policy gradient framework with the Actor–Critic structure, while incorporating the target network and experience replay mechanisms originally introduced in DQN. In DDPG, the actor network represents the policy, whereas the critic network approximates the value function [57]. Popov et al. [58] proposed an improved asynchronous DPG algorithm with configurable update frequencies, enhancing the utilization of experience samples. Schulman et al. [59] developed the Trust Region Policy Optimization (TRPO) algorithm, which constrains policy parameter updates to ensure that the expected return improves monotonically. Subsequently, Schulman et al. [60] simplified TRPO and introduced Proximal Policy Optimization (PPO), which employs stochastic gradient ascent and multi-step updates to achieve stable policy improvement. In parallel, the Soft Actor–Critic (SAC) [61] algorithm was proposed, which adopts the maximum-entropy principle by jointly maximizing expected returns and policy entropy, thereby enhancing exploration and robustness. Collectively, these policy based approaches have substantially advanced the state of the art in DRL, offering powerful solutions for high-dimensional, continuous action control problems and laying a solid foundation for subsequent algorithmic innovations.

In the broader context of mechanical system control, DRL has increasingly been adopted to address trajectory tracking and motion control tasks hampered by unmodeled dynamics and complex mechanical constraints. For instance, recent work by Wang et al. [62] proposed a DRL-based tracking controller that integrates intrinsic exploration mechanisms to manage input dead zones and dynamic uncertainties in robotic manipulators. Similarly, Xu et al. [63] utilized hierarchical DRL architectures to govern the highly coupled kinematics of a 12-degree-of-freedom mobile robot, maintaining robust tracking even under severe wheel slippage. Furthermore, Pavlichenko [64] demonstrated the efficacy of DRL in correcting reference trajectories for flexible-joint manipulators directly on real-world hardware, effectively compensating for mechanical flexibilities. These studies collectively highlight the capability of DRL to bypass precise mathematical modeling and adaptively suppress generalized mechanical nonlinearities.

Within the domain of friction compensation, DRL has emerged as a promising framework for managing the uncertainty, strong nonlinearity, and time-varying behavior inherent in friction-affected systems, offering distinct advantages in complex and dynamically changing environments. For instance, Johannink et al. [65] introduced a residual reinforcement learning strategy that augments a conventional feedback controller with a DRL agent, combining their output signals to achieve robust friction compensation while retaining the stability properties of the baseline controller. In another representative study, Al-Mahasneh et al. [27] proposed an online, model-free DRL control method capable of adapting to substantial variations in friction coefficients; their approach sustained high-quality control performance even under real-time friction increases of 100–200%.

Despite these initial successes, a critical gap remains in this cross-disciplinary field: the application of DRL to extreme “low force-friction ratio” regimes. Current DRL applications largely operate in environments where active driving forces significantly outweigh frictional resistance. When confronted with the heavy-duty operational boundaries of underwater recovery where the nonlinear Stribeck friction from mechanical structures can practically overpower the primary driving force—standard DRL agents expose critical vulnerabilities. Hampered by naive noise-based exploration and sparse terminal rewards, conventional DRL agents suffer from severe exploration inefficiency and become trapped in friction-induced dead zones.

Consequently, effectively overcoming the exploration paralysis and control instability under these extreme high-friction boundaries constitutes the core challenge addressed in this study. This necessity strictly motivates the development of our tailored WSR-E-PPO architecture, which is specifically designed to bridge this cross-disciplinary gap by ensuring sustained policy exploration and robust recovery under dominant nonlinear friction.

3. Preliminaries

3.1. Principle and Modeling of Arresting Gear System

The operational workflow and structural configuration of the arresting gear system are illustrated in Figure 1. A Cartesian coordinate system is established with its origin located at the midpoint between the two fixed attachment points on the mother ship. The X-axis is defined parallel to the longitudinal axis of the mother ship, while the Y-axis extends vertically upward, perpendicular to the deck plane. The recovery hook is mounted on the ventral side (underside) of the UUV, situated near the bow section. In the pre-deployment phase (Figure 1a), the UUV initiates the recovery sequence and approaches the mother ship with an initial relative velocity

v_{0}

. Subsequently, in the initial deployment phase Figure 1b), the hook mounted on the UUV’s ventral side comes into contact with the arresting cable. As shown in Figure 1c, once the hook successfully engages, an arresting force is exerted through the cable, decelerating the UUV until its relative velocity is reduced to zero. Finally, in the retrieval phase (Figure 1d), the stationary UUV is towed along the cable trajectory to the docking point, ensuring a precise zero-velocity arrival.

To precisely define the scope of the control task in this study, it is essential to clarify that our research specifically targets the post-arrest retrieval process. Under this operational scenario, it is assumed that the UUV has already been successfully intercepted by the arresting gear and brought to a complete halt at its maximum run-out distance. Consequently, the initial state of the control environment is defined as a strictly stationary position, with an initial displacement of

x_{0} = 2.5

m and an initial velocity of

v_{0} = 0

m/s. The primary objective of the control system is to actively regulate the hydraulic arresting tension to smoothly and safely tow the UUV from a distant point back to the home docking origin (

x_{t} = 0

m), with a successful recovery defined by a final position tolerance of ±0.1 m and a velocity tolerance of ±0.01 m/s.

The subsequent analysis is based on the following set of assumptions and premises:

In the present study, only the case where the UUV impacts the arresting cable along the X-direction is considered. Based on extensive preliminary land-based tests, the influence of the deflection angle on the final test results has been found to be negligible.
The present analysis is confined to the translational motion and corresponding forces along the X-axis.
Simulation results from prior studies indicate that, irrespective of the initial drift angle or lateral deviation during engagement with the arresting cable, the UUV ultimately slid to the central point of the cable upon deceleration.

The kinematic and dynamic model of the arresting gear system is illustrated in Figure 2. The UUV initially approaches the mother ship with a specific velocity. Following the engagement of the ventral hook with the arresting cable, the UUV continues its forward motion along the X-axis due to inertia until reaching maximum displacement. Subsequently, the tension generated in the arresting cable draws the UUV backward. This dynamic process ensures the vehicle’s convergence to the predetermined recovery state (i.e., the target position and zero velocity), thereby marking the successful completion of the recovery operation.

Based on the force analysis of the UUV during the retrieval phase, the arresting angle and the corresponding arresting cable tension can be determined as follows:

\begin{matrix} θ = arctan \frac{L}{S} \end{matrix}

(1)

\begin{matrix} F_{arrest} = 2 F_{L} cos θ \end{matrix}

(2)

where

θ

is the angle between the direction of the UUV’s motion and the arresting cable, L is half the distance between pulleys A and B, S is the displacement of the UUV,

F_{arrest}

is the arresting force and

F_{L}

is the tension of the arresting cable. And the forces acting on the UUV can be modeled as:

\begin{matrix} F_{P} - R - F_{arrest} = m a \end{matrix}

(3)

where

F_{P}

is the propulsion force produced by the UUV itself, R is the hydrodynamic resistance.

An overview of the system’s mechanical architecture is presented in Figure 3. The system features a bilaterally symmetrical design where the arresting cable is routed from the internal drive unit, through a series of guide pulleys, to the bell-shaped outlet at the distal end of the mechanical arm. The primary engineering function of this bell-shaped outlet is to serve as an omnidirectional fairlead, ensuring the cable is smoothly guided into the recovery mechanism even when the underwater vehicle engages at varying off-axis angles. However, this specialized structure introduces severe nonlinear disturbances into the system dynamics. When the UUV engages the hook, the deployed segment of the cable is subjected to dynamic tension and spatial displacements along the X-axis. This forces the cable tightly against the curved inner wall of the bell-shaped outlet. The physical mechanism of the resulting friction is highly complex: as the contact angle, cable velocity, and instantaneous tension continuously fluctuate due to the vehicle’s motion in the fluid environment, the cable experiences severe, time-varying frictional resistance at this contact boundary, often exhibiting pronounced nonlinear characteristics such as the Stribeck effect. Consequently, this frictional interaction acts as a decisive bottleneck in the system dynamics. The core control difficulty lies in the fact that this intense, localized friction creates a significant and unpredictable disparity between the control torque applied by the internal drive system and the actual arresting tension experienced by the UUV at the physical exit point of the cable. Overcoming this massive, unmodeled frictional disturbance at the distal boundary of the mechanical arm is what makes precise tension regulation exceptionally challenging, necessitating highly adaptive control strategies.

The drive system, illustrated in Figure 4, utilizes a valve-controlled hydraulic motor architecture. This system generates tension on the arresting cable through the coupled operation of the hydraulic motor and the winch. The mathematical model of the system is established below. First, the relationship between the servo valve’s spool displacement and the input current can be expressed as:

\begin{matrix} τ_{v} \frac{d x_{v}}{d t} + x_{v} = k_{v} i \end{matrix}

(4)

where

τ_{v}

is the time constant of the valve,

x_{v}

is the displacement of the valve spool,

k_{v}

and i is the input current. And the hydraulic load flow is given as follows:

Q_{L} = C_{d} ω x_{v} \sqrt{\frac{1}{ρ} (P_{s} - sgn (x_{v}) P_{L})}

(5)

where

C_{d}

is the flow coefficient,

ω

is the gradient of the valve orifice area,

ρ

is the hydraulic fluid density, and

P_{L} = P_{1} - P_{2}

is the pressure drop across the load. To facilitate analysis and control design, the nonlinear model is approximated by a linear model:

Q_{L} = K_{q} x_{v} - K_{c} P_{L}

(6)

where

K_{q} = \frac{\partial Q_{L}}{\partial x_{v}}

is the flow gain, and

K_{c} = \frac{\partial Q_{L}}{\partial P_{L}}

is the flow-pressure coefficient.

For the hydraulic motor, the flow continuity equations are given as follows:

\begin{matrix} Q_{1} & = D_{m} \frac{d θ_{m}}{d t} + C_{i m} (P_{1} - P_{2}) + C_{e m} P_{1} + \frac{V_{1}}{β_{e}} \frac{d P_{1}}{d t} \end{matrix}

(7)

\begin{matrix} Q_{2} & = D_{m} \frac{d θ_{m}}{d t} + C_{i m} (P_{1} - P_{2}) + C_{e m} P_{2} + \frac{V_{2}}{β_{e}} \frac{d P_{2}}{d t} \end{matrix}

(8)

\begin{matrix} Q_{L} & = \frac{Q_{1} + Q_{2}}{2} = D_{m} {\dot{θ}}_{m} + C_{t m} P_{L} + \frac{V_{t}}{4 β_{e}} \frac{d P_{L}}{d t} \end{matrix}

(9)

where

Q_{1}

and

Q_{2}

are the flow rates into the high-pressure and low-pressure chambers,

D_{m}

is the motor displacement,

C_{i m}

and

C_{e m}

are the internal and external leakage coefficients,

C_{t m} = C_{i m} + \frac{C_{e m}}{2}

is the total leakage coefficient,

V_{t} = V_{1} + V_{2}

is the total control volume, and

β_{e}

is the effective bulk modulus of the hydraulic fluid. And the output torque generated by the hydraulic motor is given as follows:

T_{m} = D_{m} P_{L} η_{m} = J_{t} \frac{d^{2} θ_{m}}{d t^{2}} + B_{m} \frac{d θ_{m}}{d t} + G θ_{m} + T_{d}

(10)

where

T_{m}

is the effective output torque produced by the hydraulic motor,

η_{m}

is the mechanical efficiency,

θ_{m}

is the angular displacement of the hydraulic motor,

J_{t}

is the total rotational inertia,

B_{m}

is the viscous damping coefficient, G is the torsional stiffness of the load, and

T_{d}

is the external disturbance torque.

The state space model of the hydraulic system is formulated as follows:

x = [\begin{matrix} x_{1} \\ x_{2} \\ x_{3} \end{matrix}] = [\begin{matrix} θ_{m} \\ {\dot{θ}}_{m} \\ P_{L} \end{matrix}]

(11)

\{\begin{matrix} {\dot{x}}_{1} = & x_{2} \\ {\dot{x}}_{2} = & \frac{1}{J_{t}} (D_{m} x_{3} η_{m} - B_{m} x_{2} - G x_{1} - T_{d}) \\ {\dot{x}}_{3} = & \frac{4 β_{e}}{V_{t}} (K_{q} u - K_{c} P_{L} - D_{m} x_{2} - C_{m} x_{3}) \end{matrix}

(12)

3.2. Stribeck-Based Friction Modeling of the Arresting System

As shown in Figure 5a, the arresting cable is wound around the pulley, where

T_{A}

and

T_{B}

represent the tensions on the slack and tight sides of the cable, respectively. The wrap angle of the cable on the pulley is specified as

θ

.

An infinitesimal element of the cable in the contact region is analyzed, as shown in Figure 5b. Assuming the cable is massless, the force components along the horizontal and vertical directions are resolved, leading to the following expressions for static force equilibrium:

\begin{matrix} T_{θ} cos \frac{d θ}{2} + μ d N = (T_{θ} + d T_{θ}) cos \frac{d θ}{2} \end{matrix}

(13)

\begin{matrix} T_{θ} sin \frac{d θ}{2} + (T_{θ} + d T_{θ}) sin \frac{d θ}{2} = d N \end{matrix}

(14)

where

d θ

is the infinitesimal contact angle,

d N

is the contact pressure force between the cable and the pulley,

T_{θ}

is tension in the cable at the angular position

θ

, and

μ

is the coefficient of friction at the cable–pulley interface.

The

d θ

considered herein is an infinitesimal quantity. By applying equivalent transformations to the other infinitesimal terms in the expression, the equivalent expression can be obtained:

sin \frac{d θ}{2} \approx \frac{d θ}{2}

,

cos \frac{d θ}{2} \approx 1

, meanwhile, the second-order infinitesimal terms in the equation

d T_{θ} sin \frac{d θ}{2}

is neglected. By simplifying the two equations above, the resulting expression is derived as follows:

\begin{matrix} μ d N & = d T_{θ} \end{matrix}

(15)

\begin{matrix} T_{θ} d θ & = d N \end{matrix}

(16)

Solving the above two equations simultaneously yields the following result:

T_{θ} = C e^{μ θ}

(17)

where C is the constant of integration.

Applying the boundary condition

θ = 0

, at which point the tension

T_{θ}

equals the tight side tension

T_{A}

, leads to

C = T_{A}

, incorporating this into the expression results in:

T_{θ} = T_{A} e^{μ θ}

(18)

Relationship between friction force and velocity in four friction models is shown in Figure 6. In practical systems, the nonlinear friction forces arising between the cable, pulley, and bell-shaped outlet significantly degrade the accuracy of motion control. To address this issue, the Stribeck dynamic friction model is introduced in this study. Building upon the conventional friction modeling of the mechanical structure, the Stribeck model incorporates dynamic nonlinear friction effects.

The mathematical formulation of the Stribeck model is expressed as follows:

F_{f} (v) = (F_{c} + (F_{s} - F_{c}) e^{- {(\frac{| v |}{v_{s}})}^{δ}}) \cdot sign (v) + σ v

(19)

where

F_{c}

is the Coulomb friction,

F_{s}

is the static friction, v is the velocity of the system,

v_{s}

is the Stribeck characteristic velocity,

δ

is the empirical coefficient of the Stribeck effect, and

σ

is the viscous friction coefficient.

4. Method

This section first provides a concise overview of reinforcement learning theory to establish the theoretical foundation. Subsequently, the architecture of the learning agent is detailed, comprising the definition of the state representation, the specification of the action space, and the formulation of a task-specific reward function. Building upon this, a Proximal Policy Optimization (PPO)-based algorithmic framework is proposed, specifically optimized to address the dynamic characteristics and control objectives of the arresting gear system.

4.1. Introduction of Deep Reinforcement Learning

Reinforcement Learning (RL) represents a data-driven approach where an autonomous agent optimizes its decision-making process through trial-and-error interactions with the environment (as depicted in Figure 7).

Formally, the control problem is modeled as a Markov Decision Process (MDP), defined by the tuple

〈 S, A, P, R, γ 〉

, representing the state space, action space, transition dynamics, reward function, and discount factor, respectively. At each discrete time step, the agent selects an action based on the current state, receives a scalar reward, and updates its policy. The ultimate objective is to learn an optimal policy that maximizes the expected cumulative return.

4.2. Agent Design

The formulation of the Reinforcement Learning (RL) agent involves the precise definition of the state space, action space, and the reward mechanism. These elements serve as the interface between the control algorithm and the arresting gear environment.

State

Considering the operational constraints and observability of the system, the state space is designed to capture the instantaneous kinematic status of the UUV. Let x denote the longitudinal displacement and v denote the velocity. The state vector

s_{t}

at time step t is defined as:

s_{t} = [x_{t}, v_{t}]

(20)

Action

The control objective is achieved by modulating the output force of the hydraulic drive system. The action space is continuous, where the action

a_{t}

represents the command force F applied to the arresting cable. The action vector is defined as:

a_{t} = [F_{t}]

(21)

Reward

The core of the learning mechanism lies in maximizing cumulative returns. Since the reward function serves as the explicit directive for policy updates, its structure is critical for convergence stability in complex physical simulation environments. To address the challenge of sparse rewards while strictly ensuring safe deceleration, we propose a hybrid reward architecture comprising a physics-informed base reward, a Potential-Based Reward Shaping (PBRS) mechanism, and a terminal success indicator.

Let x and v denote the current position and velocity of the system relative to the target origin. The composite reward function comprises three distinct terms:

\begin{matrix} r_{b a s e} & = - k_{1} | x | - k_{2} max (0, | v | - v_{l i m i t}) \end{matrix}

(22)

\begin{matrix} r_{p b r s} & = γ Φ (x_{t}) - Φ (x_{t - 1}) \end{matrix}

(23)

\begin{matrix} r_{f i n a l} & = \{\begin{matrix} k_{4}, & | x | < δ_{x} and | v | < δ_{v} \\ 0, & otherwise \end{matrix} \end{matrix}

(24)

(1) Physics-Informed Base Penalty (

r_{b a s e}

): This term penalizes positional deviations and unsafe speeds. The distance penalty

- k_{1} | x |

provides a continuous driving force toward the target. To ensure a smooth and collision-free deceleration, we introduce a dynamic velocity envelope

v_{l i m i t}

based on fundamental kinematics:

v_{l i m i t} = clip (\sqrt{2 a_{p} | x |}, v_{min}, v_{max})

(25)

where

a_{p}

is a predefined passive deceleration constant. The term

- k_{2} max (0, | v | - v_{l i m i t})

heavily penalizes the agent only when its speed exceeds this safe physical threshold, allowing the agent to maintain high operational speeds safely without triggering premature braking.

(2) Potential-Based Reward Shaping (

r_{p b r s}

): To fundamentally solve the reward sparsity and local optima issues without altering the optimal policy, we incorporate the PBRS framework. We define a strictly monotonic linear potential function

Φ (x)

based on the initial maximum displacement

x_{max}

:

Φ (x) = k_{3} max (0, 1 - \frac{| x |}{x_{max}})

(26)

This potential is highest at the target and drops to zero at the starting threshold. The shaping reward

r_{p b r s} = γ Φ (x_{t}) - Φ (x_{t - 1})

ensures that any movement toward the target yields immediate positive reinforcement, explicitly discouraging the agent from moving backward or oscillating.

(3) Terminal Completion (

r_{t e r m}

): This sparse component rewards the agent only upon achieving rigorous capture conditions:

| x | < 0.1 m

and

| v | < 0.01 m / s

. This massive, unambiguous signal anchors the policy, ensuring the arresting system achieves a stable, near-zero velocity state precisely at the target node. All weighting coefficients are detailed in Table 1.

The final reward function at each time step is defined as follows:

r e w a r d = r_{b a s e} + r_{p b r s} + r_{f i n a l}

(27)

It is worth noting that the specific values of the kinematic boundary parameters are not arbitrarily selected, but are strictly derived from the physical constraints of the arresting system. Firstly, the upper velocity bound,

v_{max} = 0.05 m / s

, is determined by the maximum kinetic dissipation capacity of the system. Given the massive inertia of the target (450,000 kg) and the relatively limited maximum arresting force, exceeding this velocity envelope within the extremely short operational stroke (2.5 m) would make a physical overshoot inevitable, regardless of the RL control policy. Thus, restricting the agent within this envelope ensures kinematic feasibility. Secondly, the lower bound,

v_{min} = 0.006 m / s

, serves as a crucial anti-stalling tolerance. If unclipped, the theoretical limit

\sqrt{2 a_{p} | x |}

would strictly approach zero near the target, yielding infinite penalties for any microscopic movement and causing the agent to brake prematurely outside the capture zone. This threshold allows the system to smoothly glide into the final target node without being aggressively penalized. Finally, since standard PPO cannot naturally enforce strict physical bounds, we treat these kinematic limits as a pseudo-hard constraint by assigning a significantly larger weight to the excess velocity penalty (

k_{2} = 100.0

) compared to the positional penalty (

k_{1} = 5.0

). This heavily skewed ratio forces the agent to prioritize physical safety over reaching the target during the early stages of exploration, thereby guaranteeing a stable, collision-free convergence.

4.3. Algorithm Model

Proximal Policy Optimization (PPO) serves as a robust model-free, on-policy gradient method designed to optimize the trade-off between learning efficiency and algorithmic stability. Unlike standard policy gradient approaches, PPO optimizes a clipped surrogate objective that limits the deviation between the updated and behavioral policies. By imposing a hard constraint on the probability ratio, the algorithm effectively prevents performance collapse caused by excessive parameter step sizes. This clipped objective not only ensures monotonic improvement but also permits reuse of collected trajectories for multiple optimization epochs, thereby improving sample efficiency. Empirically, PPO offers a more computationally efficient and stable alternative to Trust Region Policy Optimization (TRPO), making it particularly well-suited for the complex continuous control requirements of the UUV arresting system.

In policy gradient methods, the advantage function serves as a critical metric for evaluating the relative merit of a given state–action pair compared to the average behavior under the current policy. It plays a central role in guiding gradient updates in reinforcement learning. To improve the flexibility and stability of advantage estimation, the Generalized Advantage Estimation (GAE) technique is employed. This method introduces an exponential decay factor

λ

across the temporal dimension, enabling a trade-off between bias and variance in the estimation process. The resulting advantage function is defined as follows:

A_{t} = \sum_{l = 0}^{\infty} {(γ λ)}^{l} δ_{t + l}

(28)

Specifically, the temporal-difference (TD) error is defined as:

δ_{t} = r_{t} + γ V (s_{t + 1}) - V (s_{t})

(29)

which quantifies the discrepancy between the estimated value of the current state and the observed return. In the GAE framework, a weighting factor

λ \in [0, 1]

is introduced to balance the bias–variance trade-off.

The core principle of PPO lies in constraining the deviation between the new policy and the old policy during each optimization step to prevent excessively large updates that may destabilize training. This is achieved by introducing a clipped surrogate objective function, which limits the change in the probability ratio between the new and old policies. Specifically, given the probability ratio

r_{t} (θ) = \frac{π_{θ} (a ∣ s)}{π_{θ_{old}} (a ∣ s)}

, the clipped objective function is defined as:

L_{t}^{C L I P} (θ) = E_{t} [min (r_{t} (θ) A_{t}, c l i p (r_{t} (θ), 1 - ε, 1 + ε) A_{t})]

(30)

where

A_{t}

is the estimated advantage function, and

ε

is a small positive constant that determines the clipping range. A schematic illustration of the PPO-clipped objective is presented as Figure 8.

In the early stages of reinforcement learning, the agent typically lacks sufficient knowledge of the environment, which often leads to suboptimal policy behavior such as premature convergence or entrapment in local optima. To mitigate these issues, the incorporation of an entropy term into the objective function serves to encourage greater exploration of the state–action space. This enhanced exploration promotes improved policy generalization and contributes to increased training stability throughout the learning process. Therefore, the complete loss function governing the learning process can be expressed as follows:

L_{t}^{CLIP + VF + S} (θ) = E_{t} [L_{t}^{CLIP} (θ) - c_{1} L_{t}^{VF} (θ) + c_{2} S [π_{θ}] (s_{t})],

(31)

where S denotes the entropy term, and

L_{t}^{VF} = {(V_{θ} (s_{t}) - V_{t}^{t a r g e t})}^{2}

represents the value function approximation error of the critic part.

The Proximal Policy Optimization (PPO) algorithm is structured within the actor–critic framework, which separates the learning process into two distinct but cooperative components: the actor and the critic. The actor is responsible for learning a parameterized policy

π_{θ} (a ∣ s)

, which maps observed states to a distribution over actions, while the critic learns an estimate of the state-value function

V (s)

, providing a baseline to reduce variance in policy gradient estimation. At each iteration, the agent interacts with the environment using its current policy to generate a batch of trajectories, which include sequences of states, actions, rewards, and next states. These trajectories are then used to estimate the advantage function via Generalized Advantage Estimation (GAE) to evaluate the relative quality of the taken actions. Subsequently, the policy parameters are updated by optimizing a clipped surrogate objective function, which ensures that the updated policy does not deviate excessively from the previous policy. The use of gradient ascent allows for multiple passes over the collected data, thereby improving sample efficiency. In parallel, a value function (critic) is updated to approximate the expected return, which supports advantage estimation and stabilizes learning. This on-policy training loop is repeated across multiple epochs, with new data collected after each policy update to ensure consistency with the current policy.

As illustrated in Figure 9, we have developed a simulation framework based on reinforcement learning. In this framework, the agent continuously interacts with the environment by receiving state observations and rewards. Through iterative training, it gradually learns an optimal policy that generates suitable actions to fulfill the specified control objectives.

An overview of the PPO algorithm workflow used in this study is illustrated in Algorithm 1. And the detailed network configurations are summarized in Table 2.

Algorithm 1 Training Process of the WSR-E-PPO Algorithm

Require:: Environment Arresting System, Hyperparameters (update epochs K, clipping $ϵ$ , GAE $γ, λ$ , entropy coeff c, batch episodes N)
Ensure:: Optimized actor network parameters $θ$ and critic network parameters $ϕ$

1:: Initialize actor network $π_{θ}$ and critic network $V_{ϕ}$ with random weights
2:: for iteration $= 1, 2, \dots,$ Max_Iterations do
3:: Phase 1: Trajectory Collection (Multi-Process Rollout)
4:: Initialize empty batch memory $D$
5:: for parallel worker $i = 1$ to N do
6:: Reset environment and get initial state $s_{0}$
7:: while episode not terminated do
8:: Sample continuous action $a_{t} \sim π_{θ} (a_{t} | s_{t})$
9:: Execute $a_{t}$ for $Δ t$ , observe real-time reward $r_{t}$ (including PBRS) and next state $s_{t + 1}$
10:: Store transition $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ into $D$
11:: $s_{t} \leftarrow s_{t + 1}$
12:: end while
13:: end for
14:: Phase 2: Advantage Estimation
15:: Compute TD target $R_{t} = r_{t} + γ V_{ϕ} (s_{t + 1})$ for all transitions in $D$
16:: Compute Generalized Advantage Estimation ${\hat{A}}_{t}$ based on $γ$ and $λ$
17:: Normalize advantages ${\hat{A}}_{t} \leftarrow ({\hat{A}}_{t} - μ_{\hat{A}}) / (σ_{\hat{A}} + 1 e - 8)$
18:: Phase 3: Network Updates
19:: $θ_{o l d} \leftarrow θ$
20:: for epoch $= 1$ to K do
21:: Compute probability ratio $r a t i o = \frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{o l d}} (a_{t} | s_{t})}$
22:: Compute clipped surrogate loss $L_{actor} = min (r a t i o \cdot {\hat{A}}_{t}, clip (r a t i o, 1 - ϵ, 1 + ϵ) \cdot {\hat{A}}_{t})$
23:: Compute value loss $L_{critic} = {(V_{ϕ} (s_{t}) - R_{t})}^{2}$
24:: Compute entropy bonus $H (π_{θ})$
25:: Update actor parameters $θ$ by maximizing $L_{actor} + c H (π_{θ})$ via Adam
26:: Update critic parameters $ϕ$ by minimizing $L_{critic}$ via Adam
27:: end for
28:: Clear batch memory $D$
29:: end for

5. Results

5.1. Simulation Results

To ensure the high fidelity and engineering reliability of the simulation environment, critical physical parameters particularly the highly nonlinear Stribeck friction coefficients and hydraulic actuation constraints were rigorously derived from a physical testbed via system identification techniques. Although the raw sensor datasets are withheld for commercial reasons, the identified equivalent parameters presented herein have been meticulously calibrated against actual torque and encoder feedback data. This rigorous calibration process guarantees that the simulation environment faithfully captures the real world dynamic characteristics of the arresting mechanism, rather than relying on arbitrary assumptions. The detailed configuration of the UUV system is presented in Table 3. Specifically, to ensure the stability and convergence speed of the neural networks, state and action variables are strictly normalized. Specifically, the real-world displacement and velocity are normalized by their respective maximum thresholds into a dimensionless range of [−1, 1] before being fed into the Actor–Critic networks.It should be noted that the 1450 N propulsion force denotes the steady state thrust required to overcome hydrodynamic drag and maintain a constant relative velocity difference of 0.25 m/s during the towing phase, rather than the maximum acceleration capability of the UUV.

And the key hyperparameters used for training the PPO algorithm are listed in Table 4.

The learning curve of the proposed WSR-E-PPO algorithm over 4200 training episodes is illustrated in Figure 10. The episodic return exhibits a distinct three-phase evolutionary process. In the initial phase (episodes 0 to 1500), the cumulative return climbs rapidly from approximately −14,000. This steep upward trend indicates that the Potential-Based Reward Shaping (PBRS) mechanism effectively mitigates the sparse reward problem, providing dense gradient guidance that helps the agent quickly avoid severe penalty states. During the intermediate phase (episodes 1500 to 2500), the learning curve displays notable fluctuations and a temporary performance dip. Rather than algorithmic instability, this behavior highlights the active role of the entropy regularization term, which encourages the agent to aggressively explore the state–action space to escape local optima induced by the highly nonlinear Stribeck friction. Finally, in the convergence phase (beyond episode 3000), the episodic return steadily ascends and stabilizes around a mean value of −8500. The persistent but bounded variance at the end of training is characteristic of stochastic policies in continuous control, demonstrating that the WSR-E-PPO agent has successfully converged to a robust and near-optimal recovery policy.

The kinematic trajectories of the UUV during the recovery process, specifically position and velocity, are illustrated in Figure 11. Driven by the proposed WSR-E-PPO policy, the UUV exhibits a highly smooth and monotonic approach from the initial displacement of 2.5 m to the target docking position (0.0 m). Concurrently, the velocity profile demonstrates a seamless and chatter-free transition—smoothly accelerating to a peak deceleration speed of approximately −0.042 m/s before gradually returning to zero. Notably, the entire arresting phase is completed in a single, continuous motion without any detrimental overshoot, steady-state error, or high-frequency oscillations. This optimal kinematic response further validates the algorithm’s capability to ensure a safe, stable, and one-pass recovery maneuver, fully decoupling the severe nonlinear disturbances of Stribeck friction.

The corresponding control force trajectory generated by the WSR-E-PPO policy during the recovery phase is illustrated in Figure 12. As depicted, the arresting force exhibits a highly smooth and continuous profile, initiating at a peak of approximately 800 N to provide the necessary initial deceleration, and subsequently decreasing in a graceful manner. Crucially, the entire force curve is completely free of abrupt mutations, high-frequency spikes, or oscillatory chattering. In the context of UUV arresting systems, this smooth force modulation is of paramount importance. It effectively prevents severe mechanical wear, mitigates snap-loading risks on the recovery cable, and avoids dangerous pressure surges within the hydraulic transmission. This seamless force regulation further underscores the robustness of the WSR-E-PPO algorithm, proving its capability to deliver precise, stable, and hardware-friendly actuation commands even when subjected to strong nonlinear Stribeck friction.

Figure 13 depicts the force transmission ratio (arresting cable tension versus drive output) achieved by the WSR-E-PPO method. The operating environment is subject to severe frictional coupling, where friction contributes to nearly 50% of the total cable load. Notably, since the friction model is governed by the Stribeck formulation, the resistive force exhibits strong nonlinearity with respect to the system velocity, posing a challenge for precise tension control.

The temporal evolution of the hydraulic valve control current, governed by the WSR-E-PPO controller, is depicted in Figure 14. The profile demonstrates a smooth and continuous dynamic response. Upon initiation, the control current exhibits a fast transient response, ramping up from

0.51 A

to a peak of

0.85 A

to rapidly generate the necessary braking torque for kinetic energy dissipation. Crucially, the signal trajectory is free from high-frequency chattering or abrupt fluctuations, attesting to the robustness of the WSR-E-PPO policy. This smooth modulation is essential for hydraulic systems, as it prevents valve wear and pressure surges. In summary, the algorithm delivers precise, fine-grained actuation commands, guaranteeing the operational stability and safety of the recovery mission.

Comprehensive simulation trials validate that Deep Reinforcement Learning (DRL) paradigms offer substantial improvements in dynamic response and operational efficiency compared to conventional control strategies. Notably, the proposed WSR-E-PPO algorithm outperforms all comparative benchmarks, delivering optimal performance across multiple evaluation criteria. It achieves rapid policy convergence with the highest asymptotic rewards, indicating superior learning efficiency. The resulting kinematic profiles are smooth and monotonic, ensuring precise docking with negligible overshoot, while the force output demonstrates excellent damping characteristics, thereby minimizing mechanical wear. Moreover, the method exhibits remarkable resilience against nonlinear environmental disturbances, including Stribeck friction and complex hydrodynamic drag. By enabling rapid and stable recovery under these challenging conditions, WSR-E-PPO demonstrates exceptional robustness and practical viability for deployment in complex underwater operational scenarios.

5.2. Implementation and Statistical Reproducibility

To ensure rigorous reproducibility and transparency, the simulation and algorithmic deployment were strictly controlled. The software environment was built on Python 3.10. The hardware platform was powered by an AMD Ryzen 9 9950X3D processor and an NVIDIA RTX 4080 Super GPU. To accelerate the collection of environmental transitions, a parallel multi-processing architecture was implemented, deploying 30 asynchronous workers to gather roll out batches across 4200 training episodes. To strictly guarantee reproducibility, all hardware-level random operations were locked via deterministic fixed seeds. To rigorously validate the robustness of the proposed WSR-E-PPO algorithm against such stochasticity, statistical variance evaluations were conducted. The algorithm was independently trained across multiple distinct random seeds. As illustrated in the reward convergence plots (Figure 15), the solid line represents the mean return across all seeds, while the shaded region denotes the

\pm 1

standard deviation. The significantly tightened variance confirms that the proposed reward topology drastically suppresses statistical instability, ensuring robust and reproducible convergence across diverse initialization conditions.

5.3. Robustness Evaluation Under Uncertainties

In real-world marine engineering, unmodeled environmental disturbances, hardware parameter drifts, and stochastic ocean wave impacts are inevitable. To rigorously evaluate the generalization capability and safety boundaries of the trained WSR-E-PPO policy, a series of Monte Carlo-based robustness tests were conducted. The evaluation encompasses three critical scenarios: variations in initial positions, fluctuations in initial velocities, and compounded parameter uncertainties. The corresponding physical response trajectories, encompassing both displacement and velocity profiles, are illustrated in Figure 16.

Robustness to Initial Position Variations: In practical retrieval operations, the UUV may not be halted at the exact theoretical maximum run-out distance. To test spatial robustness, the initial position

x_{0}

was uniformly randomized with a perturbation of

\pm 0.2

m around the nominal

2.5

m starting point. As shown in the displacement and velocity trajectories (Figure 16a,b), the WSR-E-PPO agent demonstrated exceptional spatial adaptability. It smoothly absorbed the initial state deviations and guided the system to the origin with a 100% success rate, strictly maintaining the velocity within the safety constraints throughout all episodes.

Robustness to Initial Velocity Perturbations: To simulate the effects of sudden ocean wave impacts occurring exactly as the towing phase initiates, the initial velocity

v_{0}

was perturbed within a range of

\pm 0.02

m/s around the nominal 0 m/s stationary state. Despite the abrupt injection of unexpected kinetic energy, the trained agent exhibited rapid stabilization, successfully arresting the transient velocity spikes and completing the retrieval with a 90% success rate (Figure 16c,d). The 10% failure instances occurred exclusively when extreme wave disturbances pushed the instantaneous velocity slightly beyond the strict safety threshold, demonstrating that the agent’s failure mode is a physical constraint violation rather than a divergent algorithmic collapse.

Robustness to Compound Parameter and Environmental Disturbances: The most rigorous stress-test involves simultaneous dynamic uncertainties. In this scenario, random continuous noise up to

\pm 10 %

was simultaneously injected into the Stribeck friction coefficients, the hydrodynamic drag parameters, and the system’s baseline constants. Despite the severity of these compounded unmodeled dynamics mapping into a heavily non-linear space, the proposed algorithm maintained a highly resilient 90% success rate (Figure 16e,f). Analysis of the trajectories indicates that even in the 10% edge cases where worst-case combinations of maximum friction and peak hydrodynamic resistance coincided, the WSR-E-PPO agent avoided catastrophic limit-cycle oscillations and exhibited safe, graceful degradation. This conclusively confirms that the proposed reward shaping paradigm empowers the agent to learn robust, generalized representations of the nonlinear underwater friction environments.

5.4. Ablation Study

To scientifically validate the individual contributions of the core mechanisms within the proposed framework, a comprehensive algorithmic ablation study was conducted. The ablation focused on the two primary enhancements: Potential-Based Reward Shaping (PBRS) and Entropy Regularization. Specifically, three degraded variants were tested under identical physical parameters and initial conditions: (1) WSR-E-PPO without PBRS (relying solely on basic sparse rewards), (2) WSR-E-PPO without Entropy Regularization (using standard deterministic exploration), and (3) Vanilla PPO (with neither mechanism). The following subsections detail the performance degradation observed in each variant.

5.4.1. Ablation of PBRS

The performance of the agent trained without the PBRS mechanism is illustrated in Figure 17. As shown in the learning curve Figure 17a, although the raw episodic return eventually plateaus, it converges to a highly suboptimal local minimum (around −6000) accompanied by extreme variance. This indicates a severe sparse reward pathology.

Without the dense, monotonic gradient guidance provided by the potential field, the agent fails to comprehend the ultimate goal of the recovery mission. This failure is starkly reflected in the kinematic and dynamic responses (Figure 17b,c). The control force output Figure 17d collapses to an ineffectively low level (peaking only at roughly 27.5 N), which is drastically insufficient to overcome the 1450 N steady-state propulsion and severe Stribeck friction. Consequently, instead of being reeled in, the UUV drifts away from the target docking station, with its position diverging beyond 6.0 m and its velocity remaining continuously positive.

This catastrophic failure confirms that under severe nonlinear friction and hydrodynamic resistance, traditional sparse rewards are entirely inadequate. The PBRS module is therefore indispensable for bootstrapping the agent out of the initial dead zone and guiding it toward the target.

5.4.2. Ablation of Entropy Regularization

To evaluate the contribution of the exploration mechanism, the second variant removed the entropy regularization term from the PPO objective function (Figure 18). As depicted in the learning curve (Figure 18a), the episodic return of the “w/o Entropy” variant climbs initially but prematurely plateaus at approximately −8000. Without the entropy bonus to penalize overly deterministic action distributions, the agent’s exploration variance decays too rapidly during the early training phase. This premature convergence is evident in the dynamic and kinematic responses (Figure 18b,c). The agent learns to output a maximum control force of only 102.5 N (Figure 18d). While this is slightly higher than the “w/o PBRS” variant, it remains vastly insufficient to punch through the friction dead zone. Consequently, the UUV still fails to be recovered, slowly drifting away from the docking station.

This ablation experiment clearly demonstrates that dense reward shaping (PBRS) alone is insufficient when dealing with heavy physical thresholds. The entropy regularization mechanism is strictly required to encourage the agent to aggressively explore the high-force action space, thereby discovering the optimal tension strategy needed for a successful recovery.

5.4.3. Ablation of Both Mechanisms

Finally, to establish a baseline of complete degradation, the third variant removed both the PBRS and entropy regularization mechanisms, effectively reverting the agent to a standard Vanilla PPO architecture (Figure 19). The results unequivocally demonstrate that standard DRL algorithms are fundamentally unequipped to handle the heavy constraints of this specific marine engineering task. As shown in the learning curve (Figure 19a), the Vanilla PPO agent converges almost immediately to a severely suboptimal local minimum (around −6000), exhibiting a lack of meaningful learning progress. Suffering from the combined pathology of sparse gradient guidance and collapsed exploration variance, the agent completely fails to grasp the recovery objective. Consequently, the control force output (Figure 19d) remains consistently negligible, peaking at a mere 32 N. This weak actuation is entirely consumed by the environmental resistance, causing the UUV to be swept away from the docking station (Figure 19b,c). The position diverges beyond 6.0 m, and the velocity exhibits a continuous, uncontrolled positive drift. This catastrophic failure of the Vanilla PPO agent perfectly closes the logical loop of the ablation study: it proves that the severe nonlinearities of Stribeck friction and hydrodynamic drag cannot be overcome by standard policy optimization alone. The simultaneous integration of Potential-Based Reward Shaping and Entropy Regularization within the proposed WSR-E-PPO framework is strictly necessary to achieve a successful and stable UUV recovery.

5.5. Comparative Evaluation

To comprehensively evaluate the performance, robustness, and specific algorithmic advantages of the proposed WSR-E-PPO algorithm, four representative control strategies were selected as comparison benchmarks. These baselines were carefully chosen to form a systematic validation hierarchy, covering traditional industrial practices, domain-specific mainstream methods, state-of-the-art reinforcement learning, and an algorithmic ablation study:

Constant Tension Control: This represents the fundamental, passive winch mechanism widely utilized as a fail-safe in real-world marine engineering. It serves as the absolute lower bound for tracking performance and energy efficiency in our comparative analysis.
Kinetic Energy Decay Trajectory + PID [19]: This strategy represents the current mainstream model-based approach in the specific domain of UUV retrieval. It plans a velocity trajectory based on kinetic energy dissipation and tracks it using a traditional PID controller. Including this benchmark rigorously demonstrates the limitations of linear feedback controllers and static mathematical models when subjected to severe, unpredictable Stribeck friction dynamics.
Soft Actor–Critic (SAC): SAC is selected as the state-of-the-art (SOTA) standard Deep Reinforcement Learning benchmark for continuous control tasks. Comparing the proposed method against standard SAC proves that our customized control architecture is highly competitive and specifically better suited for this heavily constrained, friction-dominated towing environment.
Vanilla PPO (Ablation Benchmark): To scientifically validate the core contributions of this paper, a degraded version of our algorithm Vanilla PPO without the Entropy Regularization and Potential-Based Reward Shaping (PBRS) is included as an ablation benchmark. This specifically demonstrates that without our customized reward shaping and exploration mechanisms, standard DRL agents frequently suffer from exploration inefficiency and become trapped in friction-induced dead zones or suboptimal local minima.

The comprehensive performance of the representative control strategies is systematically evaluated across three critical dimensions: returns, displacement tracking, velocity regulation, and control force stability.

To further elucidate the algorithmic advantages from a training perspective, the episodic learning curves of the data-driven agents are compared in Figure 20. This specifically includes an ablation study against a PPO baseline, which removes the Potential-Based Reward Shaping (PBRS) and entropy regularization mechanisms. As depicted in the training trajectories, the SAC agent exhibits a rapid initial surge but prematurely flatlines at a suboptimal plateau. This mathematically corroborates its physical failure shown in Figure 21, where the agent learns to “hack” the base reward by halting in the friction dead zone to avoid further dynamic movement penalties. The ablated Vanilla PPO agent demonstrates severe training instability, characterized by high variance and violent fluctuations. Without the customized exploration and dense shaping signals, it struggles to robustly navigate the sparse and deceptive reward landscape caused by the severe nonlinear friction. In contrast, the proposed WSR-E-PPO algorithm displays a fundamentally healthier, continuous learning trajectory. Guided by the PBRS mechanism, the agent successfully bypasses the deceptive local optima, maintaining a steady and stable gradient update toward the true physical objective. While the absolute numerical bounds of the Y-axis differ structurally due to the reshaped PBRS topology, the relative convergence trend explicitly proves that the integration of entropy regularization and potential-based shaping is entirely indispensable for stabilizing the learning process and guaranteeing task completion.

As illustrated in Figure 21, the simulation results clearly demonstrate the vulnerability of traditional methods in friction-dominated environments. The Constant Force strategy fails entirely, inducing severe, unattenuated oscillations that repeatedly overshoot the target, proving that passive constant tension is incapable of handling dynamic retrieval. The KEDT+PID strategy avoids extreme overshooting but suffers from an excessively sluggish convergence rate, struggling to reach the target even after 120 s. This reveals the inability of static trajectory planning to dynamically adapt to Stribeck friction variations. Furthermore, the standard SAC agent exhibits premature convergence; it approaches the 0.2 m mark but then drifts away, clearly becoming trapped in a friction-induced dead zone (a suboptimal local minimum). In stark contrast, the proposed WSR-E-PPO algorithm demonstrates impeccable spatial tracking, smoothly and monotonically driving the UUV from the 2.5 m initial position to exactly 0.0 m within 80 s, entirely eliminating steady-state errors.

A safe UUV retrieval rigorously demands bounded, smooth velocity profiles to prevent structural damage from wave impacts or cable snapping. As shown in Figure 22, the Constant Force baseline violates all safety constraints, generating violent velocity fluctuations between

- 0.15

m/s and

+ 0.14

m/s. The SAC agent fails to maintain a unidirectional retrieval, crossing the zero-velocity line and moving in reverse, corroborating its spatial drift. While KEDT+PID maintains a low velocity, its profile is overly conservative. Conversely, WSR-E-PPO generates a highly optimal, parabolic-like velocity trajectory. It safely dips to a controlled retrieval speed of approximately

- 0.045

m/s and then gracefully decelerates to an absolute standstill (

0.0

m/s) exactly as the target is reached, showcasing exceptional kinematic safety.

The fundamental superiority of the proposed method is most distinctly highlighted in the commanded force profiles (Figure 23). The KEDT+PID controller exhibits severe chattering—the command force violently oscillates, repeatedly dropping to 0 N (which would cause catastrophic cable slack in reality) and then spiking above 1200 N. This is a textbook symptom of linear feedback controllers failing to compensate for nonlinear stick-slip transitions. The SAC agent requires high initial energy and fails to stabilize a steady pulling force. However, the WSR-E-PPO algorithm delivers an incredibly smooth, chattering-free tension command. After an initial peak (

\sim 800

N) required to overcome the massive peak static friction, it perfectly adapts to the hydrodynamic drag, maintaining a steady and optimal tension (around 150–200 N). This smooth output strictly prevents actuator fatigue and ensures the longevity of the hydraulic winch system.

In summary, the comparative evaluations validate the theoretical claims of this study. Traditional model-based linear controllers (KEDT+PID) suffer from severe model mismatches and actuator chattering under violent Stribeck friction dynamics. Meanwhile, standard state-of-the-art continuous DRL agents (SAC) frequently suffer from exploration inefficiency, becoming trapped in friction-induced dead zones.

By integrating Potential-Based Reward Shaping (PBRS) with entropy-regularized exploration, the proposed WSR-E-PPO effectively overcomes these fundamental bottlenecks. It smoothly navigates the severe “low force-to-friction ratio” boundary conditions, successfully extracting the heavily nonlinear UUV system from maximum run-out to the docking origin. The proposed strategy not only ensures highly reliable spatial convergence with zero steady-state error but also strictly satisfies operational safety constraints through a chattering-free control profile, proving its immense potential for reliable deployment in complex, heavy-duty marine engineering applications.

6. Conclusions

The successful application of a deep reinforcement learning-based tension control strategy constitutes a pivotal advancement in the domain of autonomous underwater recovery. By effectively addressing the intrinsic bottlenecks of conventional control paradigms, this research demonstrates the strategic viability of integrating advanced AI techniques into marine robotics. Ultimately, this work provides a critical technical solution for ensuring the safe, efficient, and reliable dynamic retrieval of large-tonnage UUVs in complex operational environments.

To overcome the critical challenges imposed by the high inertia of large-tonnage UUVs and the stochastic nature of underwater recovery, this paper proposes a robust tension control strategy utilizing a Well-Shaped Reward Entropy-regularized PPO (WSR-E-PPO) algorithm applied to a hydraulic arresting system. The study establishes a high-fidelity simulation environment that rigorously incorporates nonlinear Stribeck friction dynamics to validate system performance under realistic conditions. By augmenting the standard PPO framework with entropy regularization and domain-informed reward shaping, the proposed method effectively mitigates the impact of complex mechanical and environmental nonlinearities. Simulation results confirm that WSR-E-PPO resolves the precision docking problem, guaranteeing both operational safety and high recovery efficiency. Benchmarking against traditional control strategies reveals that the proposed DRL approach obviates the need for heuristic PID parameter tuning and significantly curtails recovery time. Moreover, the algorithm yields superior force modulation characteristics, characterized by smoother actuation profiles for both the drive system and the arresting cable, thereby minimizing mechanical stress and enhancing system stability.

While the WSR-E-PPO algorithm exhibits superior efficacy and efficiency under significant nonlinear disturbances, successful real-world deployment requires addressing certain practical challenges and simplifying assumptions.

(1) Impact Geometry: The present study focuses on the ideal case of longitudinal impact within a constrained spatial window. Future investigations must extend to oblique engagement scenarios, where drift angles (sideslip) and lateral offsets induced by cross-currents introduce complex six-degrees-of-freedom (6-DOF) coupling effects. Analyzing the system’s resilience under these dynamic impact conditions is critical for operations validation.

(2) Environmental Fidelity and Sim-to-Real Transfer: Although the Stribeck friction model provides a rigorous basis for simulation, actual environments involve unmodeled hydrodynamic disturbances and highly stochastic uncertainties. To mitigate the Sim-to-Real gap, future work will explore integrating robust optimization frameworks with data-driven learning models (such as Domain Randomization or Digital Twins). These advancements will further enhance the adaptability of deep reinforcement learning agents, ensuring reliable recovery performance in unpredictable marine environments.

Author Contributions

Conceptualization, F.L. and X.W.; methodology, X.W. and F.L.; software, X.W., W.L. and J.H.; validation, X.W., W.L. and J.H.; formal analysis, X.W.; investigation, X.W. and W.L.; resources, F.L.; data curation, X.W. and J.H.; writing—original draft preparation, X.W.; writing—review and editing, F.L., X.W., W.L. and J.H.; visualization, X.W. and W.L.; supervision, F.L.; project administration, F.L.; funding acquisition, F.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Repoulias, F.; Papadopoulos, E. Planar trajectory planning and tracking control design for underactuated AUVs. Ocean Eng. 2007, 34, 1650–1667. [Google Scholar] [CrossRef]
Sariel, S.; Balch, T.; Erdogan, N. Naval mine countermeasure missions. IEEE Robot. Autom. Mag. 2008, 15, 45–52. [Google Scholar] [CrossRef]
Yan, Z.; Yang, Z.; Pan, X.; Zhou, J.; Wu, D. Virtual leader based path tracking control for Multi-UUV considering sampled-data delays and packet losses. Ocean Eng. 2020, 216, 108065. [Google Scholar] [CrossRef]
Yuh, J. Design and control of autonomous underwater robots: A survey. Auton. Robot. 2000, 8, 7–24. [Google Scholar] [CrossRef]
Gong, P.; Yan, Z.; Zhang, W.; Tang, J. Lyapunov-based model predictive control trajectory tracking for an autonomous underwater vehicle with external disturbances. Ocean Eng. 2021, 232, 109010. [Google Scholar] [CrossRef]
Palomeras, N.; Vallicrosa, G.; Mallios, A.; Bosch, J.; Vidal, E.; Hurtos, N.; Carreras, M.; Ridao, P. AUV homing and docking for remote operations. Ocean Eng. 2018, 154, 106–120. [Google Scholar] [CrossRef]
Zhang, W.; Zeng, J.; Yan, Z.; Wei, S.; Tian, W. Leader-following consensus of discrete-time multi-AUV recovery system with time-varying delay. Ocean Eng. 2021, 219, 108258. [Google Scholar] [CrossRef]
Jun, B.H.; Park, J.Y.; Lee, F.Y.; Lee, P.M.; Lee, C.M.; Kim, K.; Lim, Y.K.; Oh, J.H. Development of the AUV ‘ISiMI’and a free running test in an Ocean Engineering Basin. Ocean Eng. 2009, 36, 2–14. [Google Scholar] [CrossRef]
Sato, Y.; Maki, T.; Masuda, K.; Matsuda, T.; Sakamaki, T. Autonomous docking of hovering type AUV to seafloor charging station based on acoustic and visual sensing. In Proceedings of the 2017 IEEE Underwater Technology (UT); IEEE: Piscataway, NJ, USA, 2017; pp. 1–6. [Google Scholar]
Walton, J. AUV launch and recovery from US navy ships: Problems and solutions. In Proceedings of the PACON, San Francisco, CA, USA, 8–11 July 2001; PACON International: Honolulu, HI, USA, 2001; pp. 324–331. [Google Scholar]
Fan, S.; Liu, C.; Li, B.; Xu, Y.; Xu, W. AUV docking based on USBL navigation and vision guidance. J. Mar. Sci. Technol. 2019, 24, 673–685. [Google Scholar] [CrossRef]
Page, B.R.; Mahmoudian, N. Simulation-driven optimization of underwater docking station design. IEEE J. Ocean. Eng. 2019, 45, 404–413. [Google Scholar] [CrossRef]
Zhang, W.; Jia, G.; Wu, P.; Yang, S.; Huang, B.; Wu, D. Study on hydrodynamic characteristics of AUV launch process from a launch tube. Ocean Eng. 2021, 232, 109171. [Google Scholar] [CrossRef]
Zhang, W.; Teng, Y.; Chen, H.; Yu, C. On the robust model predictive control method of dynamic positioning to line for UUV recovery. In Proceedings of the OCEANS 2016 MTS/IEEE Monterey; IEEE: Piscataway, NJ, USA, 2016; pp. 1–6. [Google Scholar]
Hardy, T.; Barlow, G. Unmanned Underwater Vehicle (UUV) deployment and retrieval considerations for submarines. In Proceedings of the International Naval Engineering Conference and Exhibition; OODA Technologies Inc.: Montreal, QC, Canada, 2008; Volume 2008. [Google Scholar]
Li, Y.; Jiang, Y.; Cao, J.; Wang, B.; Li, Y. AUV docking experiments based on vision positioning using two cameras. Ocean Eng. 2015, 110, 163–173. [Google Scholar] [CrossRef]
Kim, J.; Lee, G. A study on the UUV docking system by using torpedo tubes. In Proceedings of the 2011 8th International Conference on Ubiquitous Robots and Ambient Intelligence (URAI); IEEE: Piscataway, NJ, USA, 2011; pp. 842–844. [Google Scholar]
Bai, G.; Gu, H.; Zhang, H.; Meng, L.; Tang, D. V-shaped wing design and hydrodynamic analysis based on moving base for recovery AUV. In Proceedings of the 2018 WRC Symposium on Advanced Robotics and Automation (WRC SARA); IEEE: Piscataway, NJ, USA, 2018; pp. 320–325. [Google Scholar]
Wang, X.; Liang, L.; Lei, M.; Huang, J. Research on Variable Tension Control Strategy for UUV Arresting Gear System. In Proceedings of the ISOPE International Ocean and Polar Engineering Conference; ISOPE: Mountain View, CA, USA, 2024; p. ISOPE-I-24-245. [Google Scholar]
Shang, W.; Cong, S.; Zhang, Y. Nonlinear friction compensation of a 2-DOF planar parallel manipulator. Mechatronics 2008, 18, 340–346. [Google Scholar] [CrossRef]
Li, B.; Xie, X.; Yu, B.; Liao, Y.; Fan, D. Data-driven friction modeling and compensation for rotary servo actuators. Front. Mech. Eng. 2024, 19, 41. [Google Scholar] [CrossRef]
Kim, M.J.; Beck, F.; Ott, C.; Albu-Schäffer, A. Model-free friction observers for flexible joint robots with torque measurements. IEEE Trans. Robot. 2019, 35, 1508–1515. [Google Scholar] [CrossRef]
Lischinsky, P.; Canudas-de Wit, C.; Morel, G. Friction compensation for an industrial hydraulic robot. IEEE Control Syst. Mag. 2002, 19, 25–32. [Google Scholar]
Liu, Y.; Alambeigi, F. Impact of generic tendon routing on tension loss of tendon-driven continuum manipulators with planar deformation. IEEE Robot. Autom. Lett. 2022, 7, 3624–3631. [Google Scholar] [CrossRef] [PubMed]
Kim, Y.H.; Lewis, F.L. Reinforcement adaptive learning neural-net-based friction compensation control for high speed and precision. IEEE Trans. Control Syst. Technol. 2002, 8, 118–126. [Google Scholar] [CrossRef]
Hernández, R.; García-Hernández, R.; Jurado, F. Deep Reinforcement Learning for a Mechanical System under Friction Effect. In Proceedings of the 2022 International Conference on Mechatronics, Electronics and Automotive Engineering (ICMEAE); IEEE: Piscataway, NJ, USA, 2022; pp. 29–34. [Google Scholar]
Al-Mahasneh, A.; Abu Mallouh, M.; Al-Khawaldeh, M.A.; Jouda, B.; Shehata, O.; Baniyounis, M. Online reinforcement learning control of robotic arm in presence of high variation in friction forces. Syst. Sci. Control Eng. 2023, 11, 2251521. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 1998; Volume 1. [Google Scholar]
Nguyen, H.; La, H. Review of deep reinforcement learning for robot manipulation. In Proceedings of the 2019 Third IEEE International Conference on Robotic Computing (IRC); IEEE: Piscataway, NJ, USA, 2019; pp. 590–595. [Google Scholar]
Koh, S.; Zhou, B.; Fang, H.; Yang, P.; Yang, Z.; Yang, Q.; Guan, L.; Ji, Z. Real-time deep reinforcement learning based vehicle navigation. Appl. Soft Comput. 2020, 96, 106694. [Google Scholar] [CrossRef]
Pérez-Gil, Ó.; Barea, R.; López-Guillén, E.; Bergasa, L.M.; Gomez-Huelamo, C.; Gutiérrez, R.; Diaz-Diaz, A. Deep reinforcement learning based control for Autonomous Vehicles in CARLA. Multimed. Tools Appl. 2022, 81, 3553–3576. [Google Scholar] [CrossRef]
Chen, P.; Pei, J.; Lu, W.; Li, M. A deep reinforcement learning based method for real-time path planning and dynamic obstacle avoidance. Neurocomputing 2022, 497, 64–75. [Google Scholar] [CrossRef]
Wang, L.; Zhang, G.; Yang, Q.; Han, T. An adaptive traffic signal control scheme with Proximal Policy Optimization based on deep reinforcement learning for a single intersection. Eng. Appl. Artif. Intell. 2025, 149, 110440. [Google Scholar] [CrossRef]
Lampaert, V.; Swevers, J.; Al-Bender, F. Comparison of model and non-model based friction compensation techniques in the neighbourhood of pre-sliding friction. In Proceedings of the 2004 American Control Conference; IEEE: Piscataway, NJ, USA, 2004; Volume 2, pp. 1121–1126. [Google Scholar]
Ruderman, M.; Iwasaki, M. Observer of nonlinear friction dynamics for motion control. IEEE Trans. Ind. Electron. 2015, 62, 5941–5949. [Google Scholar] [CrossRef]
Ruderman, M. Tracking control of motor drives using feedforward friction observer. IEEE Trans. Ind. Electron. 2013, 61, 3727–3735. [Google Scholar] [CrossRef]
Huang, W.S.; Liu, C.W.; Hsu, P.L.; Yeh, S.S. Precision control and compensation of servomotors and machine tools via the disturbance observer. IEEE Trans. Ind. Electron. 2009, 57, 420–429. [Google Scholar] [CrossRef]
Lin, F.J.; Shieh, H.J.; Huang, P.K.; Shieh, P.H. An adaptive recurrent radial basis function network tracking controller for a two-dimensional piezo-positioning stage. IEEE Trans. Ultrason. Ferroelectr. Freq. Control 2008, 55, 183–198. [Google Scholar]
Lin, F.J.; Shieh, P.H.; Hung, Y.C. An intelligent control for linear ultrasonic motor using interval type-2 fuzzy neural network. IET Electr. Power Appl. 2008, 2, 32–41. [Google Scholar] [CrossRef]
Selmic, R.R.; Lewis, F.L. Neural-network approximation of piecewise continuous functions: Application to friction compensation. IEEE Trans. Neural Netw. 2002, 13, 745–751. [Google Scholar] [CrossRef]
Tan, K.; Lee, T.H.; Zhou, H.X. Micro-positioning of linear-piezoelectric motors based on a learning nonlinear PID controller. IEEE/ASME Trans. Mechatron. 2001, 6, 428–436. [Google Scholar] [CrossRef]
Amthor, A.; Zschack, S.; Ament, C. Position control on nanometer scale based on an adaptive friction compensation scheme. In Proceedings of the 2008 34th Annual Conference of IEEE Industrial Electronics; IEEE: Piscataway, NJ, USA, 2008; pp. 2568–2573. [Google Scholar]
Lin, C.M.; Li, H.Y. Intelligent control using the wavelet fuzzy CMAC backstepping control system for two-axis linear piezoelectric ceramic motor drive systems. IEEE Trans. Fuzzy Syst. 2013, 22, 791–802. [Google Scholar] [CrossRef]
Olsson, H.; Åström, K.J.; De Wit, C.C.; Gäfvert, M.; Lischinsky, P. Friction models and friction compensation. Eur. J. Control 1998, 4, 176–195. [Google Scholar] [CrossRef]
De Wit, C.C.; Olsson, H.; Astrom, K.J.; Lischinsky, P. A new model for control of systems with friction. IEEE Trans. Autom. Control 1995, 40, 419–425. [Google Scholar] [CrossRef]
Swevers, J.; Al-Bender, F.; Ganseman, C.G.; Projogo, T. An integrated friction model structure with improved presliding behavior for accurate friction compensation. IEEE Trans. Autom. Control 2002, 45, 675–686. [Google Scholar] [CrossRef]
Hsieh, C.; Pan, Y.C. Dynamic behavior and modelling of the pre-sliding static friction. Wear 2000, 242, 1–17. [Google Scholar] [CrossRef]
Al-Bender, F.; Lampaert, V.; Swevers, J. The generalized Maxwell-slip model: A novel model for friction simulation and compensation. IEEE Trans. Autom. Control 2005, 50, 1883–1887. [Google Scholar] [CrossRef]
Kabziński, J.; Jastrzębski, M. Practical implementation of adaptive friction compensation based on partially identified LuGre model. In Proceedings of the 2014 19th International Conference on Methods and Models in Automation and Robotics (MMAR); IEEE: Piscataway, NJ, USA, 2014; pp. 699–704. [Google Scholar]
Zschäck, S.; Büchner, S.; Amthor, A.; Ament, C. Maxwell Slip based adaptive friction compensation in high precision applications. In Proceedings of the IECON 2012-38th Annual Conference on IEEE Industrial Electronics Society; IEEE: Piscataway, NJ, USA, 2012; pp. 2331–2336. [Google Scholar]
Arulkumaran, K.; Deisenroth, M.P.; Brundage, M.; Bharath, A.A. Deep reinforcement learning: A brief survey. IEEE Signal Process. Mag. 2017, 34, 26–38. [Google Scholar] [CrossRef]
Nguyen, N.D.; Nguyen, T.; Nahavandi, S. System design perspective for human-level agents using deep reinforcement learning: A survey. IEEE Access 2017, 5, 27091–27102. [Google Scholar] [CrossRef]
Tsitsiklis, J.; Van Roy, B. Analysis of temporal-diffference learning with function approximation. IEEE Trans. Autom. Control 1997, 42, 674–690. [Google Scholar] [CrossRef]
Van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI: Washington, DC, USA, 2016; Volume 30. [Google Scholar]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
Popov, I.; Heess, N.; Lillicrap, T.; Hafner, R.; Barth-Maron, G.; Vecerik, M.; Lampe, T.; Tassa, Y.; Erez, T.; Riedmiller, M. Data-efficient deep reinforcement learning for dexterous manipulation. arXiv 2017, arXiv:1704.03073. [Google Scholar] [CrossRef]
Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; Moritz, P. Trust region policy optimization. In Proceedings of the International Conference on Machine Learning; PMLR: London, UK, 2015; pp. 1889–1897. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning; PMLR: London, UK, 2018; pp. 1861–1870. [Google Scholar]
Wang, F.; Hu, J.; Qin, Y.; Guo, F.; Jiang, M. Trajectory tracking control based on deep reinforcement learning for a robotic manipulator with an input deadzone. Symmetry 2025, 17, 149. [Google Scholar] [CrossRef]
Xu, H.; Terakawa, T.; Komori, M. Deep-reinforcement-learning-based trajectory tracking control for slidable-wheel omnidirectional mobile robot. J. Adv. Mech. Des. Syst. Manuf. 2025, 19, JAMDSM0031. [Google Scholar] [CrossRef]
Pavlichenko, D.; Behnke, S. Real-robot deep reinforcement learning: Improving trajectory tracking of flexible-joint manipulator with reference correction. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA); IEEE: Piscataway, NJ, USA, 2022; pp. 2671–2677. [Google Scholar]
Johannink, T.; Bahl, S.; Nair, A.; Luo, J.; Kumar, A.; Loskyll, M.; Ojea, J.A.; Solowjow, E.; Levine, S. Residual reinforcement learning for robot control. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA); IEEE: Piscataway, NJ, USA, 2019; pp. 6023–6029. [Google Scholar]

Figure 1. The workflow and the relationship of the arresting gear system. (a) Pre-Deployment phase. (b) Initial Deployment phase. (c) Arresting phase. (d) Retrieval phase.

Figure 2. The kinematic and dynamic model of the arresting gear system. Here, A and B denote the two cable payout points.

Figure 3. The mechanical architecture of the arresting gear system. The vertical dash-dot line represents the axis of symmetry of the mechanism.

Figure 4. Hydraulic Drive System of the arresting gear system.

Figure 5. Schematic and force analysis of the arresting cable wrapped around the pulley. (a) Schematic diagram of the arresting cable winding around the pulley and its force analysis. (b) Schematic diagram of the forces acting on an infinitesimal element. In (a), the vertical dash-dot line represents the vertical centerline (y-axis), and the radial dashed lines define the wrap angle

θ

. In (b), the horizontal dashed line indicates the horizontal reference line for the tension vectors.

Figure 5. Schematic and force analysis of the arresting cable wrapped around the pulley. (a) Schematic diagram of the arresting cable winding around the pulley and its force analysis. (b) Schematic diagram of the forces acting on an infinitesimal element. In (a), the vertical dash-dot line represents the vertical centerline (y-axis), and the radial dashed lines define the wrap angle

θ

. In (b), the horizontal dashed line indicates the horizontal reference line for the tension vectors.

Figure 6. Relationship between friction force and velocity in different friction models. (a) Coulomb friction model. (b) Coulomb–viscous friction model. (c) Static–Coulomb–Viscous friction model. (d) Stribeck friction model.

Figure 7. Relationship between the agent and the environment in reinforcement learning.

Figure 8. Schematic Diagram of the Clipped Objective Function in PPO. (a)

A > 0

; (b)

A < 0

. The dashed lines indicate the coordinate projections of the clipping thresholds (

1 \pm ε

) and the neutral probability ratio (1) onto the axes.

Figure 8. Schematic Diagram of the Clipped Objective Function in PPO. (a)

A > 0

; (b)

A < 0

. The dashed lines indicate the coordinate projections of the clipping thresholds (

1 \pm ε

) and the neutral probability ratio (1) onto the axes.

Figure 9. Architecture of WSR-E-PPO.

Figure 10. Learning curve of the proposed WSR-E-PPO algorithm over 4200 training episodes. The rapid initial ascent and ultimate stable convergence demonstrate the effectiveness of the reward shaping and entropy-regularization mechanisms.

Figure 11. Kinematic response of the UUV controlled by the WSR-E-PPO method. (a) Position versus time, illustrating a direct, monotonic convergence to the target without overshoot (position trajectory). (b) Velocity versus time, demonstrating a smooth, chatter-free deceleration profile for a safe one-pass recovery (velocity trajectory).

Figure 12. Arresting force trajectory regulated by the WSR-E-PPO algorithm, demonstrating smooth and continuous tension modulation without abrupt fluctuations.

Figure 13. Force transmission ratio (arresting cable tension vs. drive output) achieved by the proposed WSR-E-PPO method under severe Stribeck friction coupling.

Figure 14. Temporal evolution of the hydraulic valve control current generated by the WSR-E-PPO policy.

Figure 15. Learning curves of the proposed WSR-E-PPO algorithm across diverse initialization conditions. The solid dark red line represents the mean episodic return averaged over independent training runs with varying random seeds. The shaded region indicates the corresponding

\pm 1

standard deviation confidence interval, demonstrating the tight statistical variance and robust convergence properties of the formulated control policy.

Figure 15. Learning curves of the proposed WSR-E-PPO algorithm across diverse initialization conditions. The solid dark red line represents the mean episodic return averaged over independent training runs with varying random seeds. The shaded region indicates the corresponding

\pm 1

standard deviation confidence interval, demonstrating the tight statistical variance and robust convergence properties of the formulated control policy.

Figure 16. Robustness evaluation of the proposed WSR-E-PPO algorithm under three independent perturbation scenarios. The left column (a,c,e) illustrates the displacement trajectories (profiles under initial position variations), while the right column (b,d,f) details the corresponding velocity variations (profiles under compound uncertainties). The results demonstrate the agent’s robust state tracking boundaries and graceful degradation capabilities. The color gradient of the trajectories, transitioning from blue to red, indicates the parameter sweep from its lower bound to its upper bound during the robustness testing.

Figure 17. Performance evaluation of the WSR-E-PPO variant without the Potential-Based Reward Shaping (PBRS) mechanism. (a) The learning curve shows convergence to a suboptimal local minimum with high variance. (b,c) The kinematic trajectories demonstrate a complete mission failure, as the UUV drifts away from the target (0.0 m) with positive velocity. (d) The applied control force collapses to an insufficient level (under 30 N), proving the agent’s inability to overcome the initial friction and resistance threshold without dense shaping rewards. (a) Episodic return curve. (b) Position trajectory. (c) Velocity trajectory. (d) Control force output.

Figure 18. Performance evaluation of the WSR-E-PPO variant without Entropy Regularization. (a) The learning curve exhibits premature convergence to a local optimum around −8000. (b,c) The UUV continues to drift away from the target, failing to complete the recovery. (d) The control force caps at approximately 102.5 N, indicating that the rapid loss of exploration variance prevented the agent from discovering the high-force actions required to overcome the nonlinear friction and hydrodynamic resistance. (a) Episodic return curve. (b) Position trajectory. (c) Velocity trajectory. (d) Control force output.

Figure 19. Performance evaluation of the completely degraded Vanilla PPO variant (without PBRS and Entropy Regularization). (a) The learning curve demonstrates an immediate collapse into a severe local minimum. (b,c) The kinematic responses show a total mission failure, with the UUV drifting uncontrolled away from the target. (d) The control force remains negligible (peaking around 32 N), proving that standard DRL agents fail to explore or optimize effectively under the heavy nonlinear constraints of the UUV arresting environment. (a) Episodic return curve. (b) Position trajectory. (c) Velocity trajectory. (d) Control force output.

Figure 20. Comparison of episodic training returns. The ablation benchmark (Vanilla PPO) and SAC baseline suffer from high variance and premature convergence into local minima. In contrast, the proposed WSR-E-PPO maintains a stable, continuous learning trajectory to successfully solve the heavily constrained task.

Figure 21. Comparison of displacement tracking performance. The proposed WSR-E-PPO algorithm demonstrates rapid, monotonic convergence without steady-state errors or overshoot.

Figure 22. Comparison of velocity regulation and kinematic safety. WSR-E-PPO strictly ensures unidirectional retrieval and smooth deceleration to zero velocity.

Figure 23. Comparison of commanded tension force. The proposed strategy provides a smooth, chattering-free force profile, significantly mitigating actuator wear.

Table 1. Parameters of the reward function.

Parameters	Value
$k_{1}$ (position absolute penalty)	5.0
$k_{2}$ (excess velocity penalty)	100.0
$a_{p}$ (passive acceleration boundary)	0.010 m/s²
$[v_{min}, v_{max}]$ (speed limit bounds)	$[0.006, 0.05]$ m/s
$k_{3}$ (maximum potential energy)	200.0
$γ$ (discount factor)	0.98
$x_{max}$ (starting distance threshold)	2.5 m
$k_{4}$ (terminal success bonus)	150.0
$δ_{x}$ (position tolerance)	0.1 m
$δ_{v}$ (velocity tolerance)	0.01 m/s

Table 2. Actor–Critic network configuration.

Parameters	Value
State dimension	2
Action dimension	1
Actor hidden layers	2
Actor hidden units	128, 128
Actor activation	Tanh
Critic hidden layers	2
Critic hidden units	128, 64
Critic activation	Tanh

Table 3. Parameter setting for arresting system.

Category	Parameter	Symbol	Value
UUV Dynamics	Mass of UUV	m	450,000 kg
	Initial velocity	$v_{0}$	0 m/s
	Propulsion force	$F_{prop}$	1450 N
	Target threshold	$x_{threshold}$	2.5 m
Hydraulic Drive	Motor displacement	$D_{m}$	$5.0 \times 10^{- 5}$ m³/rad
	Mechanical efficiency	$η_{m}$	0.92
	Valve flow gain	$K_{q}$	0.05 m³/(s·A)
	Flow-pressure coeff	$K_{c e}$	$3.0 \times 10^{- 12}$ m³/(s·Pa)
	Total control volume	$V_{t}$	$1.0 \times 10^{- 4}$ m³
	Effective bulk modulus	$β_{e}$	$1.4 \times 10^{9}$ Pa
Friction (Stribeck)	Static friction coeff	$μ_{s}$	0.37
	Coulomb friction coeff	$μ_{c}$	0.30
	Stribeck velocity	$v_{s}$	0.1 m/s
	Stribeck shape factor	p	1.0
cable geometry	Diameter	d	0.05 m
	Material		Nylon
Controller Setup	Sampling time	$Δ t$	0.1 s

Table 4. Hyperparameters setting for WSR-E-PPO.

Parameters	Value
Total episodes	4200
Max steps per episode	1000
Batch episodes (Update frequency)	30
Discount factor $γ$	0.98
GAE parameter $λ$	0.90
Actor learning rate	0.0003
Critic learning rate	0.001
Clip parameter $ϵ$	0.2
Entropy coefficient	0.01
Grad update (epochs)	10
Optimizer	Adam

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, X.; Li, W.; Huang, J.; Liu, F. Deep Reinforcement Learning for Variable Tension Control of Unmanned Underwater Vehicle Arresting Gear Under Nonlinear Effects. Machines 2026, 14, 654. https://doi.org/10.3390/machines14060654

AMA Style

Wang X, Li W, Huang J, Liu F. Deep Reinforcement Learning for Variable Tension Control of Unmanned Underwater Vehicle Arresting Gear Under Nonlinear Effects. Machines. 2026; 14(6):654. https://doi.org/10.3390/machines14060654

Chicago/Turabian Style

Wang, Xikun, Weijia Li, Junlei Huang, and Fayou Liu. 2026. "Deep Reinforcement Learning for Variable Tension Control of Unmanned Underwater Vehicle Arresting Gear Under Nonlinear Effects" Machines 14, no. 6: 654. https://doi.org/10.3390/machines14060654

APA Style

Wang, X., Li, W., Huang, J., & Liu, F. (2026). Deep Reinforcement Learning for Variable Tension Control of Unmanned Underwater Vehicle Arresting Gear Under Nonlinear Effects. Machines, 14(6), 654. https://doi.org/10.3390/machines14060654

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Reinforcement Learning for Variable Tension Control of Unmanned Underwater Vehicle Arresting Gear Under Nonlinear Effects

Abstract

1. Introduction

2. Related Work

2.1. Friction Compensation Methods

2.2. Deep Reinforcement Learning in Friction Compensation

3. Preliminaries

3.1. Principle and Modeling of Arresting Gear System

3.2. Stribeck-Based Friction Modeling of the Arresting System

4. Method

4.1. Introduction of Deep Reinforcement Learning

4.2. Agent Design

4.3. Algorithm Model

5. Results

5.1. Simulation Results

5.2. Implementation and Statistical Reproducibility

5.3. Robustness Evaluation Under Uncertainties

5.4. Ablation Study

5.4.1. Ablation of PBRS

5.4.2. Ablation of Entropy Regularization

5.4.3. Ablation of Both Mechanisms

5.5. Comparative Evaluation

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI