LSTM-Enhanced Deep Reinforcement Learning for Robust Trajectory Tracking Control of Skid-Steer Mobile Robots Under Terra-Mechanical Constraints

Alcayaga, Jose Manuel; Menéndez, Oswaldo Anibal; Torres-Torriti, Miguel Attilio; Vásconez, Juan Pablo; Arévalo-Ramirez, Tito; Romo, Alvaro Javier Prado

doi:10.3390/robotics14060074

Open AccessArticle

LSTM-Enhanced Deep Reinforcement Learning for Robust Trajectory Tracking Control of Skid-Steer Mobile Robots Under Terra-Mechanical Constraints

by

Jose Manuel Alcayaga

¹,

Oswaldo Anibal Menéndez

¹

,

Miguel Attilio Torres-Torriti

²

,

Juan Pablo Vásconez

³

,

Tito Arévalo-Ramirez

⁴

and

Alvaro Javier Prado Romo

^1,*

¹

Departamento de Ingeniería de Sistemas y Computación, Universidad Católica del Norte, Antofagasta 1249004, Chile

²

Department of Electrical Engineering, School of Engineering, Faculty of Engineering, Pontificia Universidad Católica de Chile, Santiago 7820436, Chile

³

Energy Transformation Center, Faculty of Engineering, Universidad Andrés Bello, Santiago 7500000, Chile

⁴

Department of Mechanical Engineering and Metallurgy, School of Engineering, Faculty of Engineering, Pontificia Universidad Católica de Chile, Santiago 7820436, Chile

^*

Author to whom correspondence should be addressed.

Robotics 2025, 14(6), 74; https://doi.org/10.3390/robotics14060074

Submission received: 18 April 2025 / Revised: 21 May 2025 / Accepted: 22 May 2025 / Published: 29 May 2025

(This article belongs to the Section Intelligent Robots and Mechatronics)

Download

Browse Figures

Versions Notes

Abstract

Autonomous navigation in mining environments is challenged by complex wheel–terrain interaction, traction losses caused by slip dynamics, and sensor limitations. This paper investigates the effectiveness of Deep Reinforcement Learning (DRL) techniques for the trajectory tracking control of skid-steer mobile robots operating under terra-mechanical constraints. Four state-of-the-art DRL algorithms, i.e., Proximal Policy Optimization (PPO), Deep Deterministic Policy Gradient (DDPG), Twin Delayed DDPG (TD3), and Soft Actor–Critic (SAC), are selected to evaluate their ability to generate stable and adaptive control policies under varying environmental conditions. To address the inherent partial observability in real-world navigation, this study presents an original approach that integrates Long Short-Term Memory (LSTM) networks into DRL-based controllers. This allows control agents to retain and leverage temporal dependencies to infer unobservable system states. The developed agents were trained and tested in simulations and then assessed in field experiments under uneven terrain and dynamic model parameter changes that lead to traction losses in mining environments, targeting various trajectory tracking tasks, including lemniscate and squared-type reference trajectories. This contribution strengthens the robustness and adaptability of DRL agents by enabling better generalization of learned policies compared with their baseline counterparts, while also significantly improving trajectory tracking performance. In particular, LSTM-based controllers achieved reductions in tracking errors of 10%, 74%, 21%, and 37% for DDPG-LSTM, PPO-LSTM, TD3-LSTM, and SAC-LSTM, respectively, compared with their non-recurrent counterparts. Furthermore, DDPG-LSTM and TD3-LSTM reduced their control effort through the total variation in control input by 15% and 20% compared with their respective baseline controllers, respectively. Findings from this work provide valuable insights into the role of memory-augmented reinforcement learning for robust motion control in unstructured and high-uncertainty environments.

Keywords:

deep reinforcement learning; trajectory tracking; recurrent neural network; terra-mechanical constraints

1. Introduction

In the mining industry, the tracking control of autonomous vehicles presents significant challenges in off-road environments, where complex terra-mechanical interactions can lead to performance degradation. Autonomous driving can be affected by disturbances arising from the unpredictability of the operating environment, including uneven terrain, unpredictable frictional characteristics, and the presence of external aspects, such as airborne particulate matter. Specifically, the mining environment is characterized by complex topographical features on the soil, which reduces the soil’s traction properties, further complicating autonomous vehicle navigation [1]. These aspects not only impact the vehicle’s ability to maintain stable motion control but also introduce safety risks to both autonomous systems and human operators.

Terrain complexity exacerbates tire stress, particularly through the presence of turns and slopes that induce uneven load distribution, alongside the critical phenomena resulting from wheel–terrain interaction due to debris accumulation on the surface [2]. In particular, friction loss under these conditions leads to weakened traction forces, which can result in drifting and destabilization, thereby impairing the control of the vehicle [3]. These challenges can also affect the lifespan of tires [4], which leads to their necessitating frequent replacements and reduces the availability of the hauling fleet, which in turn impacts overall productivity. Due to the complex wheel–terrain interaction in mining mobile vehicles, there is a critical need to enhance the robustness of motion control systems to counteract terrain disturbances and mitigate the adverse effects of slip phenomena on traction and vehicle dynamics.

Traditional control methods, such as Model Predictive Control (MPC) [5,6] and Sliding Mode Control (SMC) [7,8], have been used to address such challenges in autonomous vehicle motion control and related dynamics. However, these approaches are inherently dependent on dynamic models of both the robot and its interaction with the environment. This reliance makes them vulnerable to the performance degradation caused by modeling mismatches, which may arise from inaccurate model assumptions or discrepancies between the assumed and actual vehicle dynamics. Such modeling errors can lead to suboptimal control performance, especially in highly dynamic and uncertain environments, such as those found in mining operations. Alternatively, Deep Reinforcement Learning (DRL) techniques offer a promising approach by learning control policies through continuous interaction with the environment, using real-time feedback to adapt to dynamic conditions [9,10]. This capability reduces reliance on an exact model, making DRL particularly advantageous in this study, as it enables the controller to operate based on the robot’s real-time dynamics rather than relying on precise and often complex terra-mechanical models. However, limitations in robot sensors during real-time data acquisition often result in partial observability, where the controller is unable to obtain information from a complete representation of the environment, potentially diminishing its effectiveness [11]. Although DRL-based navigation strategies have shown significant progress in mobile robots, current methods often face limitations in unstructured environments where simplified control structures are not adapted to varying motion conditions or terrain assumptions [12], highlighting the need for robust and adaptive control schemes capable of learning suitable dynamics under varying system conditions. The use of Recurrent Neural Networks (RNNs) has been studied to address this problem [11,13], leveraging historic observations to infer the current environment state. However, to the best of our knowledge, this integration has not been assessed on recent state-of-the-art DRL algorithms.

Work Contributions

The main contributions of this work are as follows:

Integration of Long Short-Term Memory (LSTM) networks into DRL-based controllers, enabling control agents to recurrently retain and use temporal dependencies under partial observability of past dynamics. This enhancement improves the agents’ ability to infer unobservable system states caused by sensor limitations and complex terrain interactions, contributing to the generalization of learned policies and greater adaptability under challenging terrain conditions.
A comprehensive evaluation of vanilla-based algorithms under policy gradient-based DRL algorithms that rely on fully observable state representations, such as PPO, DDPG, TD3, and SAC, for trajectory tracking control in skid-steer mobile robots operating under mining terra-mechanical constraints. This study analyzes the stability, adaptability, and effectiveness of the proposed algorithms while ensuring accurate trajectory tracking and regulation despite terrain disturbances and dynamic parameter variations.
The application of DRL algorithms for autonomous motion control, focusing on tracking reference trajectories in numerical simulations and real-world mining environments subject to external disturbances and uncertainties. The findings provide insights into the role of memory-augmented reinforcement learning for robust motion control in unstructured and high-uncertainty environments, advancing in the enhanced performance of DRL approaches and their applicability for mobile robots in complex mining operations.

In summary, this work aims to develop a robust motion control strategy for the trajectory tracking of autonomous skid-steer robots operating in mining environments by integrating LSTM networks into policy gradient-based DRL methods. The proposed approach addresses the challenges of partial observability arising from vehicle dynamics and terrain-induced slip due to wheel–terrain interaction, enabling adaptive control without dependence on accurate terra-mechanical models.

The remainder of the paper is organized as follows: Section 2 provides a review of related works on reinforcement learning (RL) techniques applied to trajectory tracking control in autonomous motion. Section 3 outlines the theoretical background of Deep Reinforcement Learning (DRL), laying the foundation for the proposed methodologies. In Section 4, the DRL techniques used as the basis for the proposed motion controllers are detailed. Section 5 presents the vehicle dynamic model and terrain characterization used for training and evaluating the motion controllers. Section 6 covers the control design and parameter settings implemented in the trajectory tracking controllers. Section 7 describes the experimental setup and analyzes the results obtained from the experiments. Finally, the paper ends in Section 8 with concluding remarks on this work.

2. Related Works

Trajectory tracking remains a fundamental challenge in autonomous motion control, with traditional control methods often struggling to address complexities inherent to real-world environments. For instance, approaches such as Model Predictive Control (MPC) [5], Fuzzy Logic-based control [14], and Sliding Mode Control [7] have been extensively applied in trajectory tracking in autonomous motion tasks. However, these methods have been typically designed to be adaptive and have often relied on accurate models of the robot motion dynamics, which can be challenging or even infeasible to obtain in environments in the presence of uncertainties [15]. In contrast, reinforcement learning (RL) methods currently offer a compelling alternative by learning optimal control policies without necessarily requiring an explicit model formulation of the robot system dynamics. Instead, RL enables an agent to learn motion dynamics through direct interaction with the environment, iteratively refining its policy to maximize reward signals that guide the system toward prescribed trajectory tracking control and regulation objectives. This characteristic makes RL well suited for motion control in complex and uncertain environments, and it has been successfully applied to a wide variety of applications in structured and non-structured environments, including autonomous navigation of aerial [16,17,18] and marine vehicles [19], industrial process control [20], and even complex gameplay scenarios [21]. For instance, the work in [22] addressed a path tracking problem in autonomous ground vehicles using inverse RL under matched uncertainties. However, the approach assumed idealized system conditions and omitted accounting for external disturbances or environmental variability, which limited its applicability in real-world scenarios. In [23], a Q-Learning (QL) strategy was used in combination with a PID controller for obstacle avoidance in trajectory tracking tasks for simulated wheeled mobile robots. However, as Q-Learning relies on a tabular estimation of the state–action value function, it does not scale properly according to the continuous state and action spaces, which is often necessary for the effective trajectory tracking control of autonomous vehicles [21].

To address scalability limitations in Q-Learning and continuous control action constraints, Deep Q-Networks (DQNs) may be used instead by replacing tabular Q-value estimation with neural networks, enabling the handling of high-dimensional continuous state spaces [21]. Dual Deep Q-Network (DDQN) has been adopted as a variant of the DQN method to enhance stability and reduce overestimation biases in trajectory tracking controllers [24]. However, both DQN and DDQN inherently rely on discrete action spaces, as they approximate an optimal action value function by selecting actions from a finite set, limiting their ability to execute accurate and smooth control actions. Policy gradient methods address this limitation by directly optimizing a parameterized policy, enabling control in continuous action spaces. Offline learning techniques such as Conservative QL (CQL) [25] or Bayesian QL (BQL) [26] were not considered here since our objective focused on interaction-driven policy learning with real-time adaptability rather than leveraging a pre-collected or static dataset obtained from the robot dynamics. Similarly, while model-based RL approaches such as MBPO (Model-Based Policy Optimization) [27] and PETS (Probabilistic Ensembles with Trajectory Sampling) [28] have demonstrated data efficiency, they were not used here due to our focus being on model-free methods that avoid potential model bias and complexity in highly uncertain and partially observable environments.

A persistent challenge in autonomous driving arises when sensor limitations prevent the agent from fully observing the system state at any given time [11,13]. This issue is especially critical in mining and off-road environments, where observations from range sensors or positioning systems may be intermittent or unreliable due to environmental constraints such as occlusions, dust, or irregular terrain [29]. Moreover, complex terrain topography and variable soil conditions can constrain traversability and introduce uncertainties in planning and control, resulting from uneven friction conditions and dynamic weather effects on the soil’s physical properties [16]. For instance, in [30], RL methods based on convolutional neural networks and value iteration networks were used to address maneuverability tasks. However, the approach was limited to high-level planning under conditions of known terrain maps, with reward structures learned in a fully observable setting. Then, it did not consider partial observability introduced by sensor limitations or dynamic terrain changes, nor did it incorporate low-level control disturbances arising from wheel–terrain interaction.

Deep RL techniques have shown improved robust motion control and navigability on rough terrain [31,32]. Among policy gradient methods, Proximal Policy Optimization (PPO) has been applied to motion control tasks [13], where it has been integrated with Long Short-Term Memory (LSTM) networks in distributed architectures to address partial observability. LSTM networks have been used to process historical state information, which was used exclusively during random initializations of reference states in each learning episode. Such approach enhanced state inference and early exploration, aiding to mitigate the effects of unknown initial conditions on vehicle dynamics [11,13].

While PPO demonstrated effectiveness in continuous control, the on-policy nature of PPO limits its sample efficiency compared with off-policy approaches, making it less practical in scenarios with restricted data availability. Such concern has encouraged the development of Deep Reinforcement Learning (DRL) techniques that integrate the advantages of policy gradient methods, such as their capability to handle continuous action spaces, with the sample efficiency of off-policy algorithms such as Q-Learning. For instance, Deep Deterministic Policy Gradient (DDPG) strategies optimize deterministic policies by using the deterministic policy gradient approach [33]. This method uses an off-policy learning paradigm with an actor–critic architecture, enabling more efficient data reuse and improved convergence properties. In [10,34], DDPG-based trajectory tracking controllers were designed to enhance motion control in autonomous mobile robots with ground force interactions. Although extensive simulation tests demonstrated the effectiveness of DDPG in handling terrain disturbances, robust control performance was primarily assessed in the presence of external disturbances, while the system’s ability to adapt to dynamic and complex terrain conditions such as slip dynamics or terrain variations was not comprehensively evaluated. For instance, in [35], the authors developed a one-step RL strategy based on DDGP for simultaneous motion planning and tracking control in a simulated robot under ideal terrain conditions, which restricted the agent’s ability to model temporal dependencies and perform long-horizon multi-step learning. Then, this limitation could reduce control performance in the robot task since it requires sequence learning and complete adaptation over time. Similarly, the work in [11] proposed an extension of DDPG by integrating Long Short-Term Memory (LSTM) layers into the neural network architecture, improving control performance in partially observable environments. This approach uses a kinematic controller as a reference for two independent PI dynamic controllers for ideal conditions of motion without tractions losses, taking advantage of the LSTM’s ability to retain historical state information and explore control actions. Experimental results demonstrated that the hybrid DDPG-LSTM framework outperformed both conventional PI controllers and standard DDPG under dynamic uncertainty. However, its robustness across diverse trajectory configurations, particularly in scenarios involving sharp turns and terrain disturbances, was not assessed. To address some of DDPG’s limitations, such as value overestimation and training instability, the Twin Delayed DDPG (TD3) algorithm was developed [36]. In particular, the TD3 method incorporates mechanisms to stabilize learning, such as delayed target network updates and action smoothing [37]. This algorithm has been successfully applied to trajectory tracking for racing vehicles [38], particularly for drift control in high-speed racing scenarios. Alongside TD3, the Soft Actor–Critic (SAC) algorithm was introduced in [39]. SAC trains a stochastic policy in an off-policy manner by optimizing entropy as part of the learning objective. In such work, SAC has been shown to outperform DDPG and DQN in trajectory tracking for racing applications, particularly in drift control, by showcasing its adaptability to unseen environments [40]. Hybrid methods combining DDPG with deterministic algorithms [41] or modified SAC variants with heuristic integration [42] have shown improvements in trajectory smoothness and training efficiency under static or mildly dynamic conditions. Similarly, approaches such as HDRS-based DRL [43] and improved TD3 algorithms leveraging transfer learning and prioritized experience replay [44] focus primarily on reducing convergence time and optimizing path planning and control performance in structured environments. However, these methods largely operate under fully observable state assumptions and use feedforward architectures, limiting their ability to model temporally extended dependencies required for robust decision making in scenarios with incomplete state information and complex dynamic interactions. Alternatively to the previously discussed model-free RL algorithms, model-based RL techniques greatly improve sampling efficiency by learning a representation of the model dynamics. For instance, Model-Based Policy Optimization (MBPO) was applied to trajectory tracking in autonomous driving, outperforming SAC while requiring one-order-of-magnitude fewer training episodes [45]. Nevertheless, their reliance on a learned dynamic model may be unsuitable in mining environments due to a high degree of uncertainty. In particular, unpredictable terrain phenomena such as variable slip conditions, as well as operational hazards such as airborne materials, can degrade sensor reliability, introducing uncertainty into state estimation. These challenges could introduce significant modeling errors, which may lead the agent to exploit inaccuracies in the dynamic model and learn a suboptimal control policy. Moreover, while these RL-based methods have shown promise across various trajectory tracking tasks, selecting the most suitable algorithm for a given problem remains complex. Several factors, such as the agent’s observation space, the tuning of hyperparameters, and the design of the reward function, significantly impact the effectiveness of DRL-based controllers [46]. Therefore, this work also proposes a comparative study of DRL techniques, specifically for trajectory tracking in mobile robots subject to terra-mechanical constraints, to provide deeper insights into their performance and applicability.

3. Theoretical Background

In this section, the vehicle dynamics for trajectory tracking are formulated within the framework of a Markov Decision Process (MDP). Under this formulation, an agent selects an action

a_{t}

from the action space

A

based on a policy

π (s_{t})

, which maps observed system states

s_{t}

to optimal control decisions. The selected action transitions the system to a new state within the state space

S

and yields a reward

r_{t}

from the reward set

R

, which quantifies the quality of the chosen action in achieving the trajectory tracking objective. The goal of the proposed RL algorithms is to learn an optimal policy

π^{*} (s_{t})

so that the cumulative discounted reward

R_{t}

is maximized, defined as

R_{t} = \sum_{k = 0}^{\infty} γ^{k} r_{t + k + 1},

(1)

where

r_{t}

represents the immediate reward at each time step, and

γ \in [0, 1]

is the discount factor that determines the relative balance of immediate and future rewards. By optimizing

π^{*} (s)

, the agent learns to make decisions that not only maximize immediate rewards but also improve long-term performance. This optimization process requires balancing the trade-off between exploration, where the agent gathers new information about the environment, and exploitation, where it applies acquired knowledge to maximize rewards. Such a balance is crucial in complex and dynamic environments, where adaptability and continuous learning enable the agent to achieve robust control performance and efficient trajectory tracking.

To quantify the expected cumulative reward associated with specific actions and states, the action–state value function

Q^{π} (a_{t}, s_{t})

, commonly referred to as the Q-function, is used. This function evaluates the long-term reward obtained by taking action

a_{t}

in state

s_{t}

and subsequently following policy

π (s_{t})

. It is formally defined as follows:

Q^{π} (a_{t}, s_{t}) = E_{r_{t}, s_{t + 1} \sim p (\cdot | s_{t}, a_{t})} [r_{t} + γ E_{a_{t + 1} \sim π} [Q^{π} (a_{t + 1}, s_{t + 1})]],

(2)

where

a_{t}

represents the control action taken at time step t,

s_{t}

denotes the environment state at time step t,

p (s_{t + 1}, r_{t} ∣ s_{t}, a_{t})

is the state transition probability distribution,

γ

is the discount factor, and

π (a_{t + 1} ∣ s_{t + 1})

represents the probability of selecting action

a_{t + 1}

under policy

π

given state

s_{t + 1}

[47].

4. Methodology

This study compares model-free DRL algorithms, including Proximal Policy Optimization (PPO), Deep Deterministic Policy Gradient (DDPG), Twin-Delayed Deep Deterministic Policy Gradient (TD3), and Soft Actor–Critic (SAC), all within the actor–critic framework, as presented in Table 1. The focus of this work is on comparing on-policy and off-policy techniques, which enable the learning of both stochastic and deterministic policies through direct interaction with the robotic environment. A key aspect of this approach is the integration of Long Short-Term Memory (LSTM) networks into the DRL framework, as illustrated in the example in Figure 1, where LSTM networks are combined with a DDPG approach. This hybrid approach allows the agents to not only perform real-time trajectory tracking but also adapt to environmental uncertainties and terrain disturbances. Specifically, the integration of LSTM networks enables the agents to retain and process temporal dependencies, which is crucial when dealing with partial observability due to sensor limitations or rapidly changing environmental conditions. This capability allows the agents to infer unobservable system states, which is particularly beneficial in terrains where dynamic interactions between the robotic vehicle and the environment generate terra-mechanical constraints such as slip, which could lead to motion performance degradation. For completeness, the following subsections provide a detailed explanation of each DRL technique and their integration with the proposed LSTM networks.

4.1. Proximal Policy Optimization

Proximal Policy Optimization (PPO) is an on-policy reinforcement learning algorithm based on insights of Trust Region Policy Optimization (TRPO) [49]. PPO here optimizes a parameterized policy while constraining the step size of policy updates by limiting the deviation of a new policy’s probability distribution from an old policy. This approach mitigates the risk of policy collapse due to excessively large updates. Instead of explicitly enforcing a trust region constraint as in TRPO, the PPO strategy is aimed at achieving a similar effect by clipping the policy update step by using a probability ratio

\frac{π_{θ} (a_{t} | s_{t})}{{π_{θ}}_{o l d} (a_{t} | s_{t})}

. Consequently, the policy objective function becomes

\begin{matrix} L^{P P O} (θ) = E_{t} [min (\frac{π_{θ} (a_{t} | s_{t})}{{π_{θ}}_{o l d} (a_{t} | s_{t})} {\hat{A}}_{t}, clip (\frac{π_{θ} (a_{t} | s_{t})}{{π_{θ}}_{o l d} (a_{t} | s_{t})}, 1 - ϵ, 1 + ϵ) {\hat{A}}_{t})], \end{matrix}

(3)

where

{\hat{A}}_{t}

is an estimate of the advantage value for action

a_{t}

on state

s_{t}

and

ϵ

is the clipping threshold that limits the policy deviation. By clipping the probability ratio within the interval

[1 - ϵ, 1 + ϵ]

, the PPO approach prevents excessive policy updates and maintains training stability with reduced variance [48].

4.2. Deep Deterministic Policy Gradient

The Deep Deterministic Policy Gradient (DDPG) algorithm is an off-policy method designed to estimate an action value function

Q (s, a)

by using neural networks, specifically tailored for environments with continuous action spaces [33]. Indeed, DDPG operates with an architecture composed by two primary components: the actor and the critic. The actor defines a policy function

μ (s | θ^{μ})

, where

θ^{μ}

represents the set of trainable parameters that are iteratively updated to optimize the learning objective

J = E_{r, s \sim E, a \sim π} [R_{t}]

, while the critic estimates an action value function

Q^{μ} (s, a)

, which evaluates the quality of the actions a taken by the actor. The DDPG approach [50] is often used to update the actor, as it provides the gradient of the expected return with respect to deterministic policies, thus allowing for efficient updates in continuous action spaces. The learning objective focuses on optimizing the deterministic policy

μ

by estimating

Q^{μ}

, as formulated in Equation (4). Thus, to provide a policy

μ

, the Q-function can be formulated as the expected return as follows:

\begin{matrix} Q^{μ} (s_{t}, a_{t}) = E_{r_{t}, s_{t + 1} \sim E} [r_{t} + γ Q^{μ} (s_{t + 1}, μ (s_{t + 1}))], \end{matrix}

(4)

where

γ

is a discount factor, and the expectation is removed from the policy

μ

to allow for off-policy learning. This deterministic policy removes randomness in action selection, making the critic’s learning more stable. The critic’s Q-function

Q^{μ}

is parameterized by neural network weights

θ^{Q}

, and the objective is to minimize the loss function:

\begin{matrix} L (θ^{Q}) = E_{s_{t} \sim ρ^{β}, a_{t} \sim β, r_{t} \sim E} [{(Q (s_{t}, a_{t} | θ^{Q}) - y_{t})}^{2}], \end{matrix}

(5)

where

ρ^{β}

is the state probability distribution under exploratory policy

β

[33] and

y_{t}

is the target value for the Q-function, given by

\begin{matrix} y_{t} = r (s_{t}, a_{t}) + (Q^{'} (s_{t + 1}, μ^{'} (s_{t + 1} | θ^{μ^{'}}) | θ^{Q^{'}}), \end{matrix}

(6)

where

μ^{'} (s)

and

Q^{'} (s, a)

represent the target actor and target critic, respectively. To enable the control of continuous actions, the actor is defined as the parameterized function

μ (s | θ^{μ})

, which is optimized by the gradient of J, as follows:

\begin{matrix} \nabla_{θ^{μ}} J \approx E_{s_{t} \sim ρ^{β}} [\nabla_{θ^{μ}} Q (s_{t}, a_{t} | θ^{Q}) |_{s = s_{t}, a = μ (s_{t} | θ^{μ})}] \end{matrix}

(7)

\begin{matrix} = & E_{s_{t} \sim ρ^{β}} [\nabla_{a} Q (s_{t}, a | θ^{Q}) |_{s = s_{t}, a = μ (s_{t})} \nabla_{θ^{μ}} μ (s | θ^{μ}) |_{s = s_{t}}] . \end{matrix}

(8)

This optimized gradient provides the direction for improving the actor’s performance based on the critic’s evaluation. Incorporating ideas from [21], a replay buffer is used here to store transitions and random state and action samples. Such approach avoids temporal correlations among pieces of past information with the aim of maintaining independence and identically distributed conditions necessary for stable optimization. The target networks for both the actor and the critic are updated by using soft target updates, which ensures that the target networks slowly track the learned networks. This technique stabilizes training and improves convergence [33]. The exploratory policy used in DDPG to define the learning target incorporates noise to encourage exploration. The exploratory policy is then given by

β (s_{t}) = μ (s_{t} | θ^{μ}) + N,

(9)

where

N

represents noise added to the action, typically modeled as an Ornstein–Uhlenbeck process [33]. This process introduces temporally correlated noise to lead the agent to explore the environment more effectively, particularly in continuous action spaces.

4.3. Twin Delayed Deep Deterministic Policy Gradient

The DDPG technique may experience overestimation bias due to maximization of noisy Q-value estimates, which can lead to overly optimistic evaluations of suboptimal states [37]. To address this issue, clipped double Q-Learning is often introduced, accounting for two independently trained critic networks. Moreover, to address the overestimation bias inherent in Q-Learning, the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm introduces key modifications to account for stability and learning efficiency. By selecting the minimum Q-value estimate from two independently trained critics, TD3 effectively counteracts overestimations, resulting in more reliable value function approximations and stabilizing the learning process [37]. However, estimations of value functions can still introduce further variance, slowing down learning and degrading policy performance. To counteract this issue, TD3 implements two improvements. First, when the learned policy is suboptimal or constrains errors, the overestimation of the value function can lead to policy degradation, generating inaccurate value estimates and policy updates. While target networks facilitate the control of error accumulation by stabilizing value predictions, frequent policy updates can still cause divergence [37]. Then, an update delay is proposed to reduce critic estimation errors [37], ensuring that the critic maintains enough time to converge before modifying the policy. Further, deterministic policies may experience over-fitting when obtaining the Q-function, which can impact policy generalization. To address this issue, TD3 is aimed at smoothing target actions by adding clipped noise to the target action in the learning objective, ensuring that the disturbed action remains close to the original while preventing excessive variance. Such enhancements in TD3 are intended to achieve a more robust policy learning in continuous action spaces by reducing both overestimation bias and variance-related instability.

4.4. Soft Actor–Critic

The Soft Actor–Critic (SAC) algorithm is designed to enhance sampling efficiency and improve control robustness. It leverages a maximum entropy reinforcement learning framework, which optimizes both the expected reward and the entropy of the policy to encourage exploration and stability [39]. Entropy maximization encourages the policy to explore the environment more effectively by discouraging deterministic behavior and reducing reliance on suboptimal actions. To balance exploration and exploitation dynamically, SAC adapts the entropy coefficient term, weighting the importance of entropy maximization.

4.5. Long Short-Term Memory

In this study, Long Short-Term Memory (LSTM) networks are integrated into the baseline DRL strategies presented in Section 4.1–Section 4.4 to improve adaptability and robustness in motion control for skid-steer mobile robots navigating partially observable environments with uncertain terra-mechanical interactions. The primary motivation for integrating LSTM is to compensate for the agent’s incomplete perception of the system dynamics, which arises due to sensor limitations (e.g., slip detection sensors with restricted availability), terrain interactions (e.g., dynamic friction estimation), environmental occlusions (e.g., GPS signal loss in obstructed areas), and measurement noise in state estimation (e.g., inaccuracies in differential encoders mounted on the wheel axles). These constraints may impair the ability of conventional DRL approaches to infer accurate system dynamics, leading to suboptimal policy learning and degraded control performance in real-world settings.

The LSTM approach is aimed at overcoming the long-term dependency problem faced by traditional Recurrent Neural Networks (RNNs). When processing sequential data from mobile robots, RNNs usually struggle to effectively recover earlier information due to the vanishing gradient or exploding gradient problem [51]. Then, LSTM addresses such concern by introducing one cell state, one hidden state, and three gates (i.e., input gate, forget gate, and output gate) to regulate the flow of information across time steps, as shown in Figure 2. In particular, given the historical state observations for each DRL agent

z_{t}

, cell state

C_{t}

stores information along the learning process and is updated at each time step, depending on the three gates. The forget gate

f_{t}

decides the amount of previous memory that should be discarded. The input gate

i_{t}

regulates how much new information should be stored in the cell state. Finally, the output gate

o_{t}

controls the information that is passed by the next layer through the hidden state

h_{t}

.

By incorporating LSTM into the proposed DRL framework, the agent is complemented with an internal mechanism built upon an RNN architecture to store and use historical observations, effectively reconstructing a more complete representation of the underlying motion system dynamics. This is particularly critical in trajectory tracking, where instantaneous sensor readings are not always enough to obtain accurate control actions due to limited observability and partial accessibility of the system state. With the LSTM network’s ability to model temporal dependencies, the agent is able to account for past observations in order to predict and compensate for transient disturbances, dynamic variations in terrain conditions, and model uncertainties such as parameter variations. This improves the consistency of the learned policies of the proposed agent, leading to more stable and efficient motion control. Table 2 provides a summary of the baseline approaches, detailing their key characteristics and associated parameters.

5. Environment Characterization: Skid-Steer Mobile Robot

A skid-steer mobile robot (SSMR) is used in this work as part of the environment to assess the proposed DRL approaches for trajectory tracking control. The robot incorporates a differential steering mechanism combined with a pair-driven four-wheel system to distribute the effects of lateral skidding and longitudinal slip experienced by each wheel, as illustrated in Figure 3. The SSMR model assumes a uniform load distribution across the vehicle to simplify the internal force balance. However, the SSMR is also prone to external forces induced by terrain unevenness, which may disrupt this balance and introduce additional disturbances. As shown in Figure 3, the SSMR model highlights key force interactions, including longitudinal traction forces

F_{x}

and lateral resistance forces

F_{y}

for each wheel. Unlike ideal unicycle models, this approach adopts the SSMR model presented in [5,52] as the environment for training and testing the proposed DRL algorithms. This model offers a more realistic representation of wheel–ground interactions, especially off-road or on uneven terrains commonly found in mining environments, where traction varies dynamically. To account for the nonholonomic restrictions inherent in skid-steer systems, the model explicitly incorporates lateral motion constraints, which are typically neglected in simplified models. This ensures a more accurate representation of the SSMR dynamics, particularly when navigating surfaces with different friction characteristics or when executing turns that may induce lateral skidding. However, model uncertainties still persist in the dynamics associated with the robot model due to the heterogeneous conditions of the navigation terrain. The motion model for the SSMR can be described as follows:

\begin{matrix} [\begin{matrix} \dot{x} \\ \dot{y} \\ \dot{ψ} \\ \dot{v} \\ \dot{ω} \end{matrix}] & = [\begin{matrix} v cos θ - ω sin θ \\ v sin θ + ω cos θ \\ ω \\ \frac{β_{3}}{β_{1}} ω^{2} + \frac{β_{4}}{β_{1}} v^{2} \\ \frac{β_{5}}{β_{2}} v ω + \frac{β_{6}}{β_{2}} ω \end{matrix}] + [\begin{matrix} 0 & 0 \\ 0 & 0 \\ 0 & 0 \\ \frac{1}{β_{1}} & 0 \\ 0 & \frac{1}{β_{2}} \end{matrix}] [\begin{matrix} u_{v} \\ u_{ω} \end{matrix}], \\ {+ [δ_{x} δ_{y} δ_{ψ} δ_{v} δ_{ω}]}^{T}, \\ - \dot{x} sin ψ + \dot{y} cos ψ - [2 a + c - b] \dot{ψ} = 0 \end{matrix}

(10)

where x, y, and

ψ

denote the robot’s pose, and v and

ω

are linear and angular speeds, respectively. Slippage and disturbances in skidding forces are accounted for in the vector

δ = {[δ_{x} δ_{y} δ_{ψ} δ_{v} δ_{ω}]}^{T}

, which characterizes disturbances affecting the robot’s motion, including position, orientation, and speed errors. The vector of model parameters

β = {[β_{1} β_{2} β_{3} β_{4} β_{5} β_{6}]}^{T}

is derived from the characteristics of the robot chassis, such as mass, inertia, and geometric specifications, obtained from [5]. Lastly,

u_{v}

and

u_{ω}

represent the control actions applied to the SSMR, namely, traction and turning speeds, respectively. The reader is referred to [5] for further details of the SSMR model.

6. Reinforcement Learning-Based Tracking Control

In the proposed DRL framework, the agent learns an optimal control policy through continuous interaction with the robotic environment. At each time step, it receives state observations

o_{t}

and a reward signal

r_{t}

, which guide the policy update process. The objective is to maximize the cumulative reward over time, ensuring accurate trajectory tracking. The policy is modeled as a neural network that maps observations to control actions, progressively improving its performance through exploration and exploitation. The following sections detail the state observations, reward function, and agent architecture for control.

6.1. State Observation

The environment states consist of the robot’s variables

s = {[x, y, ϕ, v, ω]}^{T}

from Equation (10), which denote the robot pose and kinematic conditions. At each time step t, the reference trajectory is similarly composed by the coordinate position, orientation, and linear and angular speeds, which are used to compute trajectory errors and ensure coordinated tracking. The tracking errors are given by

\begin{matrix} e_{t}^{x} & = x_{t}^{r} - x_{t}, e_{t}^{y} = y_{t}^{r} - y_{t}, e_{t}^{ψ} = ψ_{t}^{r} - ψ_{t}, \\ e_{t}^{v} & = v_{t}^{r} - v_{t}, e_{t}^{ω} = ω_{t}^{r} - ω_{t} \end{matrix}

(11)

where

{[x_{t}, y_{t}]}^{T}

represents the robot’s current position in the global reference frame G and

{[x_{t}^{r}, y_{t}^{r}]}^{T}

denotes the reference trajectory position. Similarly,

ϕ_{t}

and

ϕ_{t}^{r}

indicate the robot’s actual and reference orientation angles, respectively, whereas

v_{t}, ω_{t}

and

v_{t}^{r}, ω_{t}^{r}

denote the actual reference linear and angular speeds. To capture these dynamics effectively, the observation vector is defined as

o_{t} = {[x_{t}, y_{t}, sin (ψ_{t}), cos (ψ_{t}), v, ω, e_{t}^{x}, e_{t}^{y}, e_{t}^{ψ}, e_{t}^{v}, e_{t}^{ω}]}^{T}

, where the error terms

(e_{t}^{x}, e_{t}^{y}, e_{t}^{ψ}, e_{t}^{v}, e_{t}^{ω})

quantify the discrepancies between the robot’s actual and reference states. This formulation ensures that the agent has a comprehensive representation of the system’s state, enabling it to learn adaptive control policies for effective trajectory tracking.

All inputs for the DRL algorithms are normalized within the range

[- 1, 1]

considering their maximum values. In particular, after setting the orientation angle of the robot in the range

[- π, π]

, the observations are decomposed into sine and cosine values to avoid angle discontinuities.

6.2. Reward Function Design

The reward function is designed to penalize position, orientation, and linear speed errors while simultaneously discouraging excessive control effort that may result in unstable motion. Specifically, a progressive decay function combined with linear penalization for each error component is proposed, assigning a maximum reward of 1 in the absence of trajectory tracking error and exponentially reducing the reward magnitude as such error increases. To prevent oscillatory motion and overly soft turns of the SSMR, a penalty function based on angular control actions is incorporated. Also, to guide and accelerate the agent’s learning process towards states that yield higher cumulative rewards, potential-based reward shaping is applied [53]. The potential function

Φ

is assumed as the negative of the position error, ensuring that the reward shaping term leads the robot motion towards the reference position by preserving the control policy. Consequently, the total reward function

r_{t}

is expressed as follows:

\begin{matrix} r_{t}^{s h a p e} & = Φ (o_{t + 1}) - Φ (o_{t}) \\ Φ (o_{t}) & = (- k_{s h a p e} \times ∥ (e_{t}^{x}, e_{t}^{y}) ∥) \\ r_{t}^{p} & = \exp [- k_{x} \times | e_{t}^{x} | - k_{y} \times | e_{t}^{y} |] - k_{x, y} \times ∥ (e_{t}^{x}, e_{t}^{y}) ∥ \\ r_{t}^{ψ} & = \exp [- k_{ψ} \times | e_{t}^{ψ} |] - k_{ψ} \times | e_{t}^{ψ} | \\ r_{t}^{v} & = \exp [- k_{v} \times | e_{t}^{v} |] - k_{v} \times | e_{t}^{v} | \\ r_{t}^{ω} & = - k_{ω} \times | u_{ω} | \\ r_{t}^{s} & = - k_{s} \sum_{i = 1}^{4} |\frac{v_{x, i - wheel} - v_{x, chassis}}{max (| v_{x, i - wheel} |, | v_{x, chassis} |) + ϵ_{s}}| - k_{α} \sum_{i = 1}^{4} |\frac{v_{y, i - wheel}}{max (| v_{x, i - wheel} |, ϵ_{α})}| \\ r_{t} & = r_{t}^{p} + r_{t}^{ψ} + r_{t}^{v} + r_{t}^{ω} + r_{t}^{s} + r_{t}^{s h a p e} \end{matrix}

where

k_{x}, k_{y}, k_{x, y}, k_{ψ}, k_{v}, k_{ω}

,

k_{s}

,

k_{α}

, and

k_{s h a p e}

are scaling parameters that adjust the magnitude of penalties applied to each error component. Particularly, the parameters

k_{x}, k_{y}, k_{ψ},

and

k_{v}

are mainly used as exponential decay coefficients, regulating how fast the reward decreases while the tracking error increases. These parameters are heuristically selected to ensure that reward components

r_{t}^{p}, r_{t}^{ψ},

and

r_{t}^{v}

approach zero for relative small tracking errors. In addition, the parameters

k_{x, y}, k_{ψ}

,

k_{v}

, and

k_{ω}

also adjust the linear penalization, ensuring that larger errors receive proportionally greater penalties, further encouraging precise control. The terms

k_{s}

and

k_{α}

are specifically introduced to penalize speed differences between wheel and chassis speed as a way of longitudinal and lateral slip behavior during motion. The first term, associated with

k_{s}

, penalizes longitudinal slip by comparing the speed of each wheel

v_{x, i - wheel}

along the x-axis to the chassis velocity

v_{x, chassis}

, discouraging significant discrepancies that could lead to inefficient or unstable traction losses. The second term, scaled by

k_{α}

, penalizes lateral slip by evaluating the side speed components (y-axis) of each wheel relative to its forward motion, thus encouraging stable turning trajectories with minimized lateral drift. The penalization of each wheel slip is constrained within a defined range, with tolerances

ϵ_{s}

and

ϵ_{α}

heuristically tuned to regulate the sensitivity to longitudinal and lateral slip deviations. These slip-related penalties not only aim at enhancing penalization over slippage phenomena but also contribute critically to maintaining traction and directional stability in SSMRs operating on loose or irregular terrains.

6.3. Agent Architecture for Tracking Control

Figure 4 presents the neural network architectures used for the actor and critic networks in the agents, implemented for the proposed DRL techniques. The DDPG technique was implemented by using a deterministic actor network that generates control actions normalized within the range

[- 1, 1]

based on control policy

μ

. Here, the critic network is also defined to assess Q as a function of the pair

(a, o)

. As shown in Figure 4a, the actor and critic work coordinately to approximate the optimal control policy and estimate the value function in DDPG. Similarly, the TD3 algorithm implements an actor-and-critic network scheme, as designed for DDPG. However, as TD3 uses two critic networks to mitigate overestimation bias, the second critic network is a replica of the first one, initialized with different parameters and updated by using a separate target network. On the other hand, the PPO agent learns a stochastic policy that is parametrized by a mean function

\bar{π}

and standard deviation

σ (s)

to represent a probability distribution over the control actions, as illustrated in Figure 4b. The PPO approach uses a state value function V as the critic network to estimate the expected return for a given observation and constrain possible large policy deviations. Likewise, the SAC algorithm learns a stochastic policy. The SAC actor network follows the same architecture as the PPO actor, while its critic network uses a Q-function approximator similar to the critic structure used in DDPG and TD3.

To address partial observability in the dynamic environment of the mobile robot, recurrent architectures incorporating LSTM networks are integrated into the agents. In particular, Figure 4c illustrates the recurrent deterministic policy network, where LSTM units are introduced into the deterministic actor and Q-function critic networks for DDPG and TD3, enabling temporal information processing. Meanwhile, Figure 4d depicts the recurrent stochastic policy network, where LSTM layers are embedded into the stochastic actor and state value function critic in PPO, along with the stochastic policy and Q-function critic used in SAC. Note that the LSTM layer configuration in the proposed architectures follows a structured criterion to effectively capture immediate sequential dependencies before processing them through dense layers responsible for generating deterministic or stochastic actions. Specifically, in deterministic policies (DDPG and TD3), the LSTM layer is positioned before the final dense layers to ensure that historical state information is incorporated into the control action, enhancing stability in the partially observable environment. For stochastic policies (PPO and SAC), the LSTM layer is placed before computing the probability distribution parameters over actions, allowing the agent to dynamically adapt its policy based on past observations while preserving exploration diversity. The integration of LSTM units at this stage, rather than earlier in the feature extraction process or after action selection, ensures the optimal utilization of sequential dependencies while preventing unnecessary computational complexity or instability. This structured incorporation of recurrent layers was aimed at enhancing the agent’s ability to cope with overall policy robustness in the dynamic robotic environment.

7. Experimental Setup and Results

To evaluate the performance of the proposed trajectory tracking controllers, both simulation and experimental trials were conducted by using the SSMR described in Section 5. The controllers were synthesized by using MATLAB V2024b (MathWorks^®, Natick, MA, USA), leveraging the Reinforcement Learning Toolbox and the Deep Learning Toolbox. The training of the DRL agents was executed on a laptop equipped with an AMD Ryzen 7 4800U processor (AMD^®, Santa Clara, CA, USA), 16 GB of RAM, and an NVIDIA GeForce RTX 3080 Ti graphics card (Nvidia^®, Santa Clara, CA, USA) to accelerate computations. Parallel processing was used through the Parallel Computing Toolbox to optimize efficiency by distributing tasks across multiple cores. While the implementation of the DRL algorithms was performed in MATLAB, the training, testing, and validation of the simulated agents were carried out within the Simulink environment (V24.2). The simulation setup featured a custom-built robotic environment in Simulink, which provides real-time feedback on position, orientation, and linear and angular speeds, with data sampled every 0.1 s. Figure 5 presents the implementation diagram of an agent and the simulation environment.

For experimental validation, field trials were also conducted in a realistic mining environment. Specifically, a geographical model of an underground industrial mine was selected as a testbed [54]. The dataset was collected from a real-world mining environment within an active pit division, including a complex underground mine with over 4500 km of drifts, located in Chile’s Libertador General Bernardo O’Higgins Region [55]. The navigation terrain featured irregular surfaces, debris, and surface cavities partially covered by loose soil, generating conditions prone to slip phenomena due to the interaction with vehicle dynamics. Point-cloud maps of narrow tunnels and uneven surfaces with heterogeneous terrain conditions were first acquired by using 3D LiDAR (Riegl®, Horn, Austria), 2D radar (Acumine®, Eveligh, Australia), and stereo cameras (Point Grey Research®, Vancouver, BC, Canada) to define the test environment. The 3D LiDAR scans were manually aligned to estimate the robot pose throughout the traversed tunnels, obtaining a complete projection of the environment, from which a top-view 2D navigation map was extracted [54]. This top-view map was utilized as a testing environment, and a reference trajectory was generated by connecting pre-planned waypoints that fit on the navigation surface, ensuring alignment with spatial constraints. Only 2D point-cloud data were used to maintain a compact and control-relevant state representation suitable for low-level trajectory tracking under terra-mechanical constraints. The use of full 3D perception was intentionally avoided to reduce the state space dimensionality, minimize the computational burden, and ensure the feasibility of real-time policy deployment. This setup enabled a comprehensive evaluation of the proposed trajectory tracking controllers under realistic mining conditions. Figure 6 illustrates the tunnel section where the point-cloud dataset was collected, highlighting the segmented navigation map used to assess the performance of the proposed controllers under test.

7.1. Reference Trajectory Setup

To train the DRL-based controllers, two reference trajectories were considered: (i) a lemniscate-shaped trajectory and (ii) a square-shaped trajectory. First, the lemniscate-shaped trajectory was chosen for its ability to provide continuous and smooth turns, with varying angular and linear speeds throughout its path. This trajectory exposes the training agent to changing speeds to provide the agent a wide range of exploration references, which change dynamically over time. It also enables the agent to learn how to drive diverse turning and traction motion profiles. The lemniscate shape is relevant for mining environments as it simulates conditions of curved and narrow tunnels, which are common in underground mines. The continuous nature of the trajectory allows the agent to practice adapting to gradual changes in direction and speed, similar to the maneuvers required in mining tunnels, which often have irregular geometries and non-linear paths. The lemniscate-shaped trajectory is parameterized as follows:

\begin{matrix} x^{ref} (ψ^{ref}) & = a c o s (ψ^{ref}) \\ y^{ref} (ψ^{ref}) & = a c o s (ψ^{π ref}) s e n (ψ^{ref}), \end{matrix}

where a is a scaling factor that defines the reference trajectory’s size and

ψ^{r e f} = 2 π f T

is the orientation angle for the SSMR, which varies according to a time vector

T \in [0, 1]

and frequency

f = 0.075 r a d / s

. Additionally, to evaluate the performance of the DRL agents under sharp-turning conditions, a square-shaped trajectory was also employed. The square trajectory, with a side length of

a = 5

m, was parameterized at a constant speed of 0.6 m/s, leading to a total simulation time of 333 s. This trajectory, characterized by abrupt directional changes at its corners, challenges the agent’s ability to respond to sudden orientation shifts, contrasting and complementing with the smoother transitions of the lemniscate trajectory.

7.2. Hyperparameter Configuration and Tuning

Commonly used hyperparameter configurations from related learning architectures for the control of motion dynamics were first selected [38,40]. Subsequently, a broader set of hyperparameters was initialized by using heuristic methods [10,11,13,34] during the exploration phase, ensuring functionality, bias reduction, and stable consistency of the cumulative rewards. Various parameter values were systematically evaluated to enhance training efficiency while maintaining stability in the learning process. Then, building upon these predefined values, the Whale Optimization Algorithm (WOA) [56] was applied to fine-tune the hyperparameters. This algorithm, inspired by the bubble-net feeding behavior of humpback whales, optimizes hyperparameters by iteratively refining candidate solutions based on cumulative training rewards [56]. In this context, each DRL agent is represented as a whale, the optimization objective corresponds to maximizing cumulative rewards, and the agent’s position is defined by the hyperparameter set. Following the approach in [57], eight PPO agents were trained under the same hardware conditions for ten iterations, resulting in a set of hyperparameters common to DDPG with similar values to the previously described heuristic method. This led to the selection of the final hyperparameter configurations for all the proposed DRL algorithms, which are summarized in Table 3. Table 3 also contains the number of episodes under which each algorithm required to approach convergence, as well as the maximum number of training episodes in which the algorithms were expected to stop at final stage. Based on the analysis of the slope of the mean cumulative reward, the TD3 and SAC agents exhibited the fastest policy convergence, as they reached convergence in 806 and 1146 episodes, respectively, which was expected due to their improved sample efficiency and stabilization characteristics. It is worth noting that due to computational constraints and the extensive training times required for hyperparameter tuning, the values used in this study may not represent globally optimal solutions but were selected based on the best trajectory tracking control performance within the given motion constraints.

7.3. Cumulative Reward Analysis

The DRL agents were first trained and evaluated on a lemniscate-shaped trajectory, designed to expose the controllers to a wide range of maneuverability conditions. This trajectory includes continuous turns with both positive and negative angular speeds, as well as straight segments with variable linear speeds, ensuring that the agents learn to adapt to different motion dynamics. The training phase focused exclusively on reference trajectory tracking, without external disturbances in order to isolate here their effects on the learned control policy. To reduce over-fitting and enhance generalization, reference trajectory randomization was applied at the beginning of each training episode. This was achieved by scaling and translating the lemniscate trajectory by a random amount of 25% over the predefined trajectory’s parameters while maintaining kinematic constraints. Additionally, initial condition randomization was implemented to improve the controller’s ability to realign with the reference trajectory. Specifically, an incremental value was added to the initial orientation of the SSMR at the start of each episode, with a randomly assigned sign to alter the direction of the offset. This value was progressively increased to prevent instability of learning, averaging

\frac{π}{3600}

rads per episode, up to a maximum deviation of

\frac{2 π}{3}

rads over the training procedure. Figure 7 depicts a comparison of cumulative rewards obtained after training DDPG, PPO, TD3, SAC, and their corresponding LSTM complementary models. The PPO agent achieved comparable convergence rate with respect to TD3 and SAC, which is explained by a grater exploration degree from its stochastic policy and the reduced dimensionality of the observation space that simplifies the optimal policy search space. Conversely, DDPG depicts the slowest mean reward growth compared with TD3, PPO, and SAC, which may be caused by the overestimation of the Q-function, leading to possible suboptimal local convergence. The LSTM-based approaches required more training episodes to converge on a policy compared with their base models (i.e., DDPG, PPO, TD3, and SAC). Specifically, PPO-LSTM and SAC-LSTM exhibit a lower final mean reward than PPO and SAC, respectively. For PPO-LSTM, the slower convergence could be attributed to the increased dimensionality introduced by historical sequence information, which expands the search space and thus reduces the sample efficiency. In SAC-LSTM, the performance drop may be explained by the increased complexity of entropy temperature optimization, leading to noisier control actions. Similarly, the DDPG-LSTM and TD3-LSTM approaches present comparable mean rewards to their respective base models but at a slower learning rate. The latter is expected, as the recurrent layer introduces additional parameters that require extended training to extract meaningful temporal dependencies from the historical observation sequence.

7.4. Tracking Test in Lemniscate-Type Trajectories

This test involved tracking a lemniscate reference trajectory to evaluate the control performance of the proposed agents using DDPG, PPO, TD3, SAC, and their respective LSTM-based approaches. The simulation results of this test are presented in Figure 8, and the performance metrics in terms of IAE, ISE, ITAE, ITSE, TVu, and cumulative control effort [58] are shown in Table 4. As illustrated in Figure 8, all DRL approaches presented similar trajectory tracking responses, except for PPO-LSTM and SAC-LSTM, under which the orientation and the lateral and longitudinal tracking errors were eventually deviated, as depicted in Figure 8c–f. Similarly, in Figure 8c,e, the lateral and longitudinal tracking errors were close to zero for most approaches, including DDPG, DDPG-LSTM, PPO, PPO-LSTM, TD3, and TD3-LSTM. However, among these, PPO was the only agent that achieved lower trajectory tracking errors compared with its LSTM-enhanced counterpart (i.e., PPO-LSTM), which can be attributed to its increased stochasticity relative to the other evaluated approaches. This result also suggests that the PPO agent prioritized minimizing orientation and linear speed errors over positional error, potentially favoring stability in trajectory tracking at the expense of accurate robot position. On the other hand, DDPG-LSTM achieved the lowest positional error across all agents, with reductions of 21% in the IAE and 10% in the ISE compared with TD3, which had the lowest positional error metrics among the base models. Such result in control performance can be attributed to the ability of LSTM-enhanced DDPG to retain past trajectory information, improving position tracking accuracy. In contrast, TD3-LSTM and SAC-LSTM provided only marginal improvements in tracking performance compared with their respective base models. Their IAE metrics showed minor differences, with TD3-LSTM improving by only 1.38% and SAC-LSTM increasing by 5.6% compared with SAC. Additionally, TD3-LSTM exhibited a 4.5% reduction in ISE relative to TD3, whereas SAC-LSTM had a slight 2.2% decrease in ISE compared with SAC.

Table 4 highlights differences in orientation and linear speed errors across the tested agents. For instance, PPO-LSTM exhibits the highest orientation and linear speed errors, with IAE values of 2.175 and 3.393, respectively, which are considerably higher than those of the other agents. In contrast, TD3-LSTM achieves the lowest orientation error, with an IAE of 0.036, while DDPG-LSTM demonstrates the best linear speed tracking performance, obtaining the lowest IAE of 0.065. Examining such error metrics, TD3-LSTM minimizes orientation error in terms of IAE and ITAE, whereas TD3 achieves lower ISE and ITSE values with respect to all test approaches. This behavior is also shown in Figure 8d, where TD3 maintains near-zero orientation error except for two brief deviations, whereas TD3-LSTM shows a more stable yet slightly less accurate response. Furthermore, the inclusion of LSTM layers in DDPG-LSTM and SAC-LSTM appears to contribute to lower ITAE and ITSE values for position and orientation error compared with their non-LSTM counterparts. A possible explanation for this is that LSTM’s memory mechanism aids to accumulate and mitigate orientation deviations over time, leading to corrective control actions that reduce long-term tracking errors. However, the reduced performance of PPO-LSTM may stem from the interaction between its policy update strategy and the LSTM memory dynamics, potentially leading to error accumulation rather than attenuation. The control effort exerted by each controller is also quantified in Table 4. Deterministic policy-based controllers exhibited lower control effort, with DDPG and TD3 attaining TVu values of 3.38 and 3.62, respectively. In contrast, stochastic policy-based controllers such as SAC and PPO showed significantly higher control variability, with TVu values of 27.64 and 66.82, respectively, indicating more abrupt control actions. The incorporation of LSTM into the deterministic controllers yielded further reductions in control effort. For instance, DDPG-LSTM and TD3-LSTM reduced TVu by approximately 16% and 20%, respectively, relative to their non-recurrent counterparts. This further supports that LSTM’s memory mechanism contributed to stable corrective actions. However, recurrent stochastic policies exerted a significantly higher control effort, as both PPO-LSTM and SAC-LSTM achieved TVu values of 574.02 and 494.89, respectively, indicating that the inclusion of LSTM may result in increased stochasticity. In general, these results suggest that LSTM layers store and integrate orientation error over time, enabling control actions that achieve lower tracking errors compared with those used by base controllers, as illustrated in the control actions depicted in Figure 8b.

7.5. Tracking Test in Square-Type Trajectory

This test assessed the control performance of DDPG, PPO, TD3, SAC, and their LSTM-based variants while tracking a square-type reference trajectory. The simulation results are depicted in Figure 9, whereas performance metrics (IAE, ISE, ITAE, ITSE, and TVu [58]) are summarized in Table 5. Figure 9a depicts a significant error margin while tracking the reference trajectory with DDPG, PPO, TD3, and SAC, whereas the tracked trajectory for the LSTM-based strategies closely approach the reference with minimal positional drift. Except for the baseline DRL approaches, the LSTM-based strategies exhibit minimal lateral, orientation, and speed tracking errors along the trials, as shown in Figure 9c–f. This reinforces that the proposed LSTM-based agents prioritize maintaining orientation and linear speed accuracy beyond minimizing robot position deviations to effectively track the reference trajectory. Moreover, due to the controller’s limited turning capability, turns are not completed instantaneously, causing the agent to deviate from the trajectory once the turn is executed. Agents improved positional tracking based on the trade-off between positional errors and both linear speed and orientation accuracy. Actually, unlike baseline DRL approaches, the system response with LSTM-based methods can be attributed to the added LSTM layers that noticed and memorized changes for each observed orientation and position tracking error, influencing the learning process with improved rewards for each episode. Then, such rewards prioritized reducing trajectory tracking over orientation errors. For DDPG and DDPG-LSTM, the reduced positional tracking performance and the lack of LSTM improvement on this metric may stem from these agents overestimating the value of states with lower linear speed and orientation error, thereby prioritizing these variables over positional accuracy.

The PPO-LSTM agent demonstrated improved control performance of positional tracking against the other test controllers, achieving the lowest IAE, ISE, ITAE, and ITSE metrics, as shown in Table 5. Specifically, PPO-LSTM reduced the IAE by 74% compared with PPO, while TD3-LSTM and SAC-LSTM achieved reductions of 21% and 37%, respectively, relative to their non-LSTM counterparts. While DDPG and DDPG-LSTM exhibited similar performance in positional tracking, their tracking errors were higher than those of PPO-LSTM. For the orientation error, PPO-LSTM achieved the lowest IAE (69.34) and ITAE (24,237.61), exhibiting again its effectiveness in maintaining accurate orientation control. Despite each approach presenting a similar orientation error in terms of the ISE, DDPG performed the best among all proposed approaches, obtaining an ISE of around 74.89 and an ITSE of about 1272.98. This suggests that while PPO-LSTM minimized accumulated orientation errors over time, DDPG was more effective in reducing large deviations. In terms of linear speed error, DDPG consistently outperformed all other agents across all metrics, achieving the lowest IAE (0.48), ISE (3.70), ITAE (8.40), and ITSE (40.50). Moreover, the LSTM-enhanced versions of all models introduced higher linear speed errors in the trials, indicating that the added memory component may generate slower adaptation to speed changes. Lastly, regarding the control effort and its variation, DDPG achieved the lowest overall control effort, with a TVu of 5.38, followed closely by TD3 (5.92), suggesting that both deterministic controllers applied smoother control actions. The integration of LSTM slightly increased control effort for DDPG-LSTM (TVu = 6.22) but yielded a marginal reduction for TD3-LSTM (TVu = 5.72) compared with its base model. Conversely, PPO-LSTM and SAC-LSTM exhibited the highest TVu values, reaching 187.30 and 161.03, respectively, indicating high variations in control actions during control execution. These results show that while LSTM integration benefits stochastic policies in terms of tracking accuracy, it also significantly amplifies control action variability, potentially compromising smooth motion. Overall, PPO-LSTM emerged as the best control approach for position and orientation tracking, while DDPG excelled at minimizing linear speed errors and ensuring the smoothest control actions, especially at sharp corners in the square reference trajectory. The results highlight the effectiveness of integrating LSTM for stochastic policy agents, but its impact on deterministic agents led to an increase in errors in linear speed tracking and control action smoothness, suggesting that the added memory component may have hindered rapid adaptation to dynamic changes in the robot.

7.6. Tracking Test in Mining Environment

This test involves tracking a trajectory composed of pre-planned waypoints, designed to align with maneuverability constraints in a mining environment. The reference trajectory was obtained from the underground mine map presented in Section 7.1, with trials starting at waypoint A, passing by B, and concluding at waypoint C, as illustrated in Figure 10. The experimental results of tracking this trajectory within the underground mine are shown in Figure 10, while quantitative control performance metrics are detailed in Table 6. Unlike previous tests, it is noticed in all test control approaches that the orientation error in Figure 10d initially increased at the very beginning of the trials at waypoint A due to changes in the initial pose conditions. However, as the agent began executing turning actions, the orientation error subsequently decreased. This degradation at the start of the trials may be attributed to over-fitting effects, as the trained agents tended to initiate their trajectories with a specific turning action learned during the training phase. This behavior suggests that the models developed a dependency on specific initial conditions, impacting their ability to generalize to initial conditions of new scenarios. Furthermore, during waypoint transitions, particularly at point B, abrupt turning maneuvers exposed significant differences in controller robustness. PPO- and SAC-based agents exhibited high instantaneous control inputs, leading to unstable angular responses and trajectory overshoots. These cases demonstrate limited generalization to tight angular transitions, a critical failure mode for underground navigation. Similarly, turning control actions in DDPG, PPO, PPO-LSTM, SAC, and SAC experienced abrupt compensation while turning at point B in Figure 10. Beyond the demanding turning actions, the new kinematic conditions for turning were not accurately met. Nonetheless, this issue was mitigated by improving the initial sampling strategy during the agent’s training process.

Table 6 provides a comprehensive comparison of tracking performance across the test DRL controllers. In terms of linear speed error, the lowest IAE was obtained by using the DDPG agent, achieving a value of 1.15, followed closely by SAC with 1.17, while PPO demonstrated slightly higher tracking errors, with an IAE of 1.38. Such results are consistent with the linear speed error trend, where SAC achieved the lowest ISE value (i.e., 52.24), outperforming the other test controllers. This suggests that SAC maintains stable speed control, reducing abrupt changes in the trajectory. Regarding position error metrics, TD3-LSTM achieved the lowest values across multiple metrics, including the IAE (i.e., 80,751.31) and ISE (i.e., 15,543.57), outperforming other LSTM-based models such as SAC-LSTM and DDPG-LSTM (see Table 6). This performance can be explained by TD3’s ability to mitigate overestimation bias, leading to more stable policy updates, while the LSTM layers enhance temporal information processing, allowing for better adaptation to the dynamic changes in the reference trajectory. Regarding the orientation error, TD3-LSTM also showed improved performance, attaining the lowest IAE (124.92) and ITAE (

8.55 \times 10^{6}

) and validating the advantage of memory-based architectures in handling trajectory deviations. In contrast, PPO exhibited the highest orientation error values, with an IAE of 370.88, indicating less stable heading control when navigating the underground mine environment.

In terms of deployment limitations, PPO-LSTM and SAC-LSTM exhibited inconsistent alignment in narrow corridors such as point B, where they occasionally triggered unexpected stops due to unsafe or overly conservative control commands. These issues were not anticipated during the simulation evaluations, underscoring the need to incorporate exteroceptive information into both the learning process and the reward scheme. These results also highlight the need to expose the agents to a broader range of environmental conditions during training, including the adoption of domain randomization strategies to improve policy generalization. Regarding control action smoothness, TD3 demonstrated the lowest variation in linear speed (i.e., 0.30), while TD3-LSTM minimized the variation in angular speed (i.e., 2.26). These results suggest that TD3 and its LSTM-enhanced counterpart generate smoother control actions compared with the other test controllers, reducing oscillations and improving trajectory adherence. TD3-LSTM showed the lowest TVu of 2.57, validating its ability to generate continuous control actions and mitigate oscillatory behavior in the physical actuators of the SSMR. Moreover, the results indicate that TD3-LSTM offers the best overall performance in terms of position and orientation tracking due to its ability to learn more stable policies while incorporating past observations. Conversely, while SAC and DDPG exhibit suitable control performance in linear speed control, they hardly generalize as effectively in position and orientation tracking, particularly in complex environments such as those found in underground mines.

7.7. Robustness Test Under Terrain Disturbances

The proposed controllers were tested for trajectory tracking on lemniscate and square reference trajectories under terrain disturbances to evaluate their robust control performance. The disturbances, unknown for the agents, involved transitioning from flat terrain (where no drifting occurs) to low-friction and uneven terrain, which leads to slip phenomena commonly encountered in mining environments. First, the disturbance conditions occurred across terrain transitions while tracking a lemniscate reference trajectory, specifically at time instants 100.6 [s] and 680.4 [s] of the trials. Figure 11 presents tracking results for the lemniscate trajectory, highlighting capabilities of tracking error regulation in the tested DRL controllers when rejecting external disturbances (see Figure 11c,e). At the first disturbance instant, the system response reincorporated each controller to the reference trajectory; thus, the linear speed, orientation, and lateral error reached near-zero values (see Figure 11d–f). Among the tested controllers, the proposed DDPG and TD3 approaches, along with their LSTM-enhanced variants, exhibited improved disturbance rejection against PPO and SAC, maintaining low tracking errors throughout terrain transition scenarios. In contrast, PPO-LSTM and SAC-LSTM showed increased tracking errors after the disturbance instants and exhibited slower convergence in reducing steady-state errors, leading to residual offsets throughout the trial. This suggests that while PPO combined with LSTM demonstrated adaptability by gradually reducing error offsets over time, its ability to reject disturbances was lower compared with that of the DRL baseline methods, which prioritize policy refinement through continuous value estimation. Moreover, despite all disturbance occurrences, all tested controllers successfully maintained low and close-to-zero orientation and speed errors (see Figure 11d,f), enabling the SSMR to align with the reference trajectory and recovering the reference speeds. Unlike DDPG-LSTM, TD3-LSTM, and PPO-LSTM, baseline DRL controllers such as PPO and SAC exhibited persistent lateral position errors after each disturbance event, primarily due to unobserved dynamics arising from unmodeled terra-mechanical variations, as shown in Figure 11e. Such disturbances, omitted during training, may have induced policy prioritization biases, leading the learned DRL controllers to favor orientation and linear speed corrections over accurate lateral position tracking. This highlights the need for a more comprehensive domain randomization strategy that accounts for the inherently random and heterogeneous terrain characteristics, ensuring further exploration across diverse terra-mechanical conditions.

The results of tracking for the square-type trajectory under terrain disturbances are presented in Figure 12, with disturbances occurring at approximately 12.6 [s] and 12.8 [s]. The agents show robust control performance, maintaining the trajectory orientation, as shown in Figure 12d, and reference speed, as depicted in Figure 12f. After the first disturbance, all DRL agents were capable of correcting for the external disturbances, thus showing orientation and linear speed errors close to the origin. However, the positional error was not fully corrected by the PPO agent, where the SSMR aimed to returning to the reference trajectory but maintained large steady-state tracking errors, as shown in Figure 12c,e. A similar transient response was observed after the second disturbance, where orientation and linear speed errors were again reduced to near-zero values. Longitudinal positional errors were effectively mitigated across all proposed controllers except for the one based on PPO, which initially counteracted disturbances but exhibited a continuous decline in robust control performance. Overall, these results indicate that while LSTM integration enhances trajectory tracking for most DRL agents, PPO demonstrated deficient robustness in the presence of disturbances.

It is worth noting that the training of the DRL controllers was conducted under transitory slip conditions, without incorporating prior physical information about the terrain type, such as grass, pavement, or gravel. Consequently, the learning process did not exploit any terrain classification cues but relied exclusively on observable states, including position, orientation, speed, and their respective tracking errors. Although the experimental validation was limited to a single robot configuration and a particular terrain disturbance scenario, the proposed methodology is not inherently constrained to this setup. The underlying control policy is observation-driven and model-free, allowing for potential transferability to other robotic platforms or unstructured environments exhibiting similar observation–action relationships. Nonetheless, such generalization should be approached under test conditions. Effective transfer would require that key aspects of the wheel–terrain interaction—such as friction and slip dynamics—be implicitly captured within the observation space and appropriately represented during training.

7.8. Test on Robustness Against Model Parameters

To assess the robustness of DRL agents under varying SSMR conditions, the proposed controllers were tested by using a different set of dynamic model parameters for varying robot dynamics. In this test, the parameters

β_{1}

and

β_{2}

were both gradually set to 5% increments, reaching a total variation of 25% from their original values, as defined in the SSMR model in Equation (10) and detailed in [3]. Such parameters were selected since they represent direct influence on traction and turning speeds. The results of this test for the maximum parameter variation of 25% on the lemniscate trajectory are presented in Figure 13. According to Figure 13a, the proposed DRL controllers successfully led the SSMR along the reference trajectory, maintaining reasonable tracking accuracy. However, lateral and longitudinal tracking errors progressively increased over the trial, indicating that the controllers struggled to generalize to entirely unobserved variations in system dynamics, as such variations were not explicitly included in the training process. While the control performance remained adaptable to moderate parameter variations, the tracking error increase became significant as the parameter variations approached their maximum tested value. In particular, the SSMR deviation from the reference position suggests that the proposed controllers could present potential over-fitting, even when the training process included varying robot dynamics. Comparing all tested DRL controllers, the LSTM-based DRL agents exhibited improved adaptability to dynamic variations by generating control actions that suitably accommodated model parameter changes over time. The integration of historical observations in LSTM-based controllers prevented abrupt policy changes, resulting in smoother and more stable trajectory tracking performance compared with the baseline DRL controllers. On the other hand, in Figure 13f, the DDPG and TD3 algorithms exhibited the lowest linear speed errors, followed by PPO and SAC, whereas DDPG-LSTM, TD3-LSTM, PPO-LSTM, and SAC-LSTM present higher orientation and linear speed errors with respect to the other controllers. By analyzing Figure 13d, it can also be observed that deterministic policy agents (i.e., DDPG and TD3) achieved lower orientation errors than their stochastic counterparts (i.e., PPO and SAC). This outcome aligns with the expectation that deterministic controllers can achieve more precise control in environments with parameter stability, whereas stochastic methods may introduce exploration noise that degrades orientation accuracy under dynamic variations. In the case of the on-policy recurrent agent (PPO-LSTM), an improvement in adaptability was observed compared with its base model, PPO. This behavior suggests that the use of the state value function as a critic prevents the over-fitting observed in off-policy recurrent cases. Since the critic network does not receive the agent’s action as an input, it is less likely to overlearn a linear relationship between linear speed error and traction speed control action, preserving adaptability to changing model parameters.

The control performance of the proposed DRL controllers under model parameter variations on a squared-type trajectory is presented in Figure 14. The turning response of the DRL agents remained qualitatively similar to that observed in trials without parameter variations. However, during cornering maneuvers, the agents exhibited overshoot in the turning response, leading to increased transient responses. The most significant deviations were observed in the SAC-LSTM and PPO-LSTM agents, as well as in PPO, TD3-LSTM, and DDPG-LSTM throughout the final turn of the trajectory. Figure 14f presents the linear speed error, where the TD3 agent achieved the lowest speed error values, followed by SAC and DDPG, PPO-LSTM, PPO, DDPG-LSTM, TD3-LSTM, and SAC-LSTM. Furthermore, the PPO-LSTM agent improved its positional speed error compared with the baseline PPO agent, as presented in Figure 14c,e. These results are consistent with the previous robustness test, reinforcing the observation that LSTM-based on-policy agents exhibit improved adaptability to model parameter variations by leveraging historical state information, further confirming the role of the policy architecture in adapting to system dynamics. Deterministic policy agents (DDPG and TD3) maintained better orientation stability than their stochastic counterparts (PPO and SAC), particularly in turns where rapid control adjustments were required. However, recurrent architectures (LSTM-based agents) demonstrated rougher control policies, as they presented increased control effort on turning maneuvers. While recurrent stochastic agents are expected to present a noisier response, as discussed on Section 7.3, the decrease observed in deterministic agents may be attributed to the over-fitting of the training conditions, as explained by the lemniscate-type trajectory test under model parameter variations. This suggest that LSTM-based agents’ performance on high-precision tasks such as corner turning maneuvers may require further optimization.

8. Conclusions

In this work, Deep Reinforcement Learning (DRL) techniques integrated with Long Short-Term Memory strategies were presented and validated for robust trajectory tracking control of skid-steer mobile robots (SSMRs) operating in mining environments with uncertain terrain conditions. Several DRL algorithms were assessed and compared each other, including Proximal Policy Optimization (PPO), Deep Deterministic Policy Gradient (DDPG), Twin Delayed DDPG (TD3), and Soft Actor–Critic (SAC), alongside their recurrent LSTM-enhanced counterparts, to address partial observability in real-world navigation scenarios. The proposed controllers were trained and tested on various reference trajectories, first in simulation scenarios and subsequently in a field experimental setup of mining environments. In general, performance metrics demonstrated that LSTM-based controllers significantly improved trajectory tracking accuracy, reducing the ISE by 10%, 74%, 21%, and 37% for DDPG-LSTM, PPO-LSTM, TD3-LSTM, and SAC-LSTM, respectively, compared with their non-recurrent counterparts. Furthermore, DDPG-LSTM and TD3-LSTM achieved the lowest orientation errors (in terms of the IAE, ISE and ITAE) and linear speed errors (in terms of the IAE and ITAE), outperforming baseline DRL controllers. These results highlight the effectiveness of integrating LSTM into DRL-based controllers, enabling them to leverage temporal dependencies and infer unobservable states, thereby enhancing robustness against terrain disturbances and model parameter variations. In addition, recurrent architectures improved their non-recurrent counterparts in generalizing to unobserved trajectories, suggesting that incorporating memory mechanisms enhances adaptability to dynamic and unstructured environments. However, further evaluation revealed limitations in the LSTM-based controllers, particularly within on-policy algorithms (PPO-LSTM), where a reduction in adaptability to varying model parameters and terrain disturbances was observed. This finding indicated the potential over-fitting of recurrent networks when subjected to significant model variations, emphasizing the need for improved regularization techniques or hybrid learning strategies to further enhance generalization capabilities. Future work will explore the integration of a more detailed terra-mechanical model with varying friction and slip conditions to enable a more realistic assessment of DRL controllers in adverse terrain scenarios. A formal robust control stability analysis of the proposed trajectory tracking controllers is also required in order to evaluate their practical feasibility for real-world deployment in autonomous mining operations. Furthermore, ongoing work aims to evaluate performance under heterogeneous terrain conditions and varying robot morphologies to assess the robustness and adaptability of the proposed control methodology across diverse operational scenarios.

Author Contributions

Conceptualization, A.J.P.R.; Methodology, J.M.A. and A.J.P.R.; Software, J.M.A.; Validation, J.M.A. and A.J.P.R.; Formal analysis, J.M.A. and A.J.P.R.; Investigation, J.M.A. and A.J.P.R.; Resources, A.J.P.R., J.P.V. and T.A.-R.; Data curation, A.J.P.R. and J.M.A.; Writing—original draft, J.M.A. and A.J.P.R.; Writing—review & editing, A.J.P.R., M.A.T.-T., T.A.-R. and O.A.M.; Visualization, J.M.A.; Supervision, A.J.P.R.; Project administration, A.J.P.R.; Funding acquisition, A.J.P.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by ANID (National Research and Development Agency of Chile) under project Fondecyt Iniciación en Investigación 2023, grant 11230962.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

This work was supported by ANID (National Research and Development Agency of Chile) Fondecyt Iniciación en Investigación 2023, grant 11230962. It was also supported by Fondecyt Iniciación en Investigación 2024, grants 11250090 and 11240105. During the preparation of this manuscript, the main author used ChatGPT version 4 only for grammar checking. The main author also has reviewed and edited the output and takes full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Mascaró, M.; Parra-Tsunekawa, I.; Tampier, C.; Ruiz-del Solar, J. Topological Navigation and Localization in Tunnels—Application to Autonomous Load-Haul-Dump Vehicles Operating in Underground Mines. Appl. Sci. 2021, 11, 6547. [Google Scholar] [CrossRef]
Cunningham, S. Proper Mining Conditions and How to Extend Your Mining Tire Life. Available online: https://maxamtire.com/proper-mining-conditions-and-how-to-extend-your-mining-tire-life/ (accessed on 28 August 2024).
Prado, Á.J.; Torres-Torriti, M.; Yuz, J.; Auat Cheein, F. Tube-based nonlinear model predictive control for autonomous skid-steer mobile robots with tire–terrain interactions. Control Eng. Pract. 2020, 101, 104451. [Google Scholar] [CrossRef]
Urvina, R.P.; Guevara, C.L.; Vásconez, J.P.; Prado, A.J. An Integrated Route and Path Planning Strategy for Skid–Steer Mobile Robots in Assisted Harvesting Tasks with Terrain Traversability Constraints. Agriculture 2024, 14, 1206. [Google Scholar] [CrossRef]
Prado, A.J.; Torres-Torriti, M.; Cheein, F.A. Distributed Tube-Based Nonlinear MPC for Motion Control of Skid-Steer Robots with Terra-Mechanical Constraints. IEEE Robot. Autom. Lett. 2021, 6, 8045–8052. [Google Scholar] [CrossRef]
Aro, K.; Urvina, R.; Deniz, N.N.; Menendez, O.; Iqbal, J.; Prado, A. A Nonlinear Model Predictive Controller for Trajectory Planning of Skid-Steer Mobile Robots in Agricultural Environments. In Proceedings of the 2023 IEEE Conference on AgriFood Electronics (CAFE), Torino, Italy, 25–27 September 2023; pp. 65–69. [Google Scholar] [CrossRef]
Medina, L.; Guerra, G.; Herrera, M.; Guevara, L.; Camacho, O. Trajectory tracking for non-holonomic mobile robots: A comparison of sliding mode control approaches. Results Eng. 2024, 22, 102105. [Google Scholar] [CrossRef]
Prado, A.J.; Herrera, M.; Dominguez, X.; Torres, J.; Camacho, O. Integral Windup Resetting Enhancement for Sliding Mode Control of Chemical Processes with Longtime Delay. Electronics 2022, 11, 4220. [Google Scholar] [CrossRef]
Kiran, B.R.; Sobh, I.; Talpaert, V.; Mannion, P.; Sallab, A.A.A.; Yogamani, S.; Pérez, P. Deep Reinforcement Learning for Autonomous Driving: A Survey. IEEE Trans. Intell. Transp. Syst. 2022, 23, 4909–4926. [Google Scholar] [CrossRef]
Srikonda, S.; Norris, W.R.; Nottage, D.; Soylemezoglu, A. Deep Reinforcement Learning for Autonomous Dynamic Skid Steer Vehicle Trajectory Tracking. Robotics 2022, 11, 95. [Google Scholar] [CrossRef]
Gou, W.; Liu, Y. Trajectory tracking control of wheeled mobile robot based on improved LSTM-DDPG algorithm. J. Phys. Conf. Ser. 2022, 2303, 012069. [Google Scholar] [CrossRef]
Wijayathunga, L.; Rassau, A.; Chai, D. Challenges and Solutions for Autonomous Ground Robot Scene Understanding and Navigation in Unstructured Outdoor Environments: A Review. Appl. Sci. 2023, 13, 9877. [Google Scholar] [CrossRef]
Zhang, S.; Wang, W. Tracking Control for Mobile Robot Based on Deep Reinforcement Learning. In Proceedings of the 2019 2nd International Conference on Intelligent Autonomous Systems (ICoIAS), Singapore, 28 February–2 March 2019; pp. 155–160. [Google Scholar] [CrossRef]
Batti, H.; Jabeur, C.B.; Seddik, H. Fuzzy Logic Controller for Autonomous Mobile Robot Navigation. In Proceedings of the 2019 International Conference on Control, Automation and Diagnosis (ICCAD), Grenoble, France, 2–4 July 2019; pp. 1–6. [Google Scholar] [CrossRef]
De Paula, M.; Acosta, G.G. Trajectory tracking algorithm for autonomous vehicles using adaptive reinforcement learning. In Proceedings of the OCEANS 2015—MTS/IEEE Washington, Washington, DC, USA, 19–22 October 2015; pp. 1–8. [Google Scholar] [CrossRef]
Wang, N.; Li, X.; Zhang, K.; Wang, J.; Xie, D. A Survey on Path Planning for Autonomous Ground Vehicles in Unstructured Environments. Machines 2024, 12, 31. [Google Scholar] [CrossRef]
Bouhamed, O.; Ghazzai, H.; Besbes, H.; Massoud, Y. Autonomous UAV Navigation: A DDPG-Based Deep Reinforcement Learning Approach. In Proceedings of the 2020 IEEE International Symposium on Circuits and Systems (ISCAS), Seville, Spain, 12–14 October 2020; pp. 1–5. [Google Scholar] [CrossRef]
Li, B.; Wu, Y. Path Planning for UAV Ground Target Tracking via Deep Reinforcement Learning. IEEE Access 2020, 8, 29064–29074. [Google Scholar] [CrossRef]
Zheng, Y.; Tao, J.; Hartikainen, J.; Duan, F.; Sun, H.; Sun, M.; Sun, Q.; Zeng, X.; Chen, Z.; Xie, G. DDPG based LADRC trajectory tracking control for underactuated unmanned ship under environmental disturbances. Ocean Eng. 2023, 271, 113667. [Google Scholar] [CrossRef]
Li, J.; Liu, Y.; Qing, X.; Xiao, K.; Zhang, Y.; Yang, P.; Yang, Y.M. The application of Deep Reinforcement Learning in Coordinated Control of Nuclear Reactors. J. Phys. Conf. Ser. 2021, 2113, 012030. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing Atari with Deep Reinforcement Learning. arXiv 2013, arXiv:1312.5602. [Google Scholar]
Xu, Y.; Wu, Z.G.; Pan, Y.J. Perceptual Interaction-Based Path Tracking Control of Autonomous Vehicles Under DoS Attacks: A Reinforcement Learning Approach. IEEE Trans. Veh. Technol. 2023, 72, 14028–14039. [Google Scholar] [CrossRef]
Xiao, H.; Chen, C.; Zhang, G.; Chen, C.L.P. Reinforcement learning-driven dynamic obstacle avoidance for mobile robot trajectory tracking. Knowl.-Based Syst. 2024, 297, 111974. [Google Scholar] [CrossRef]
Hassan, I.A.; Ragheb, H.; Sharaf, A.M.; Attia, T. Reinforcement Learning for Precision Navigation: DDQN-Based Trajectory Tracking in Unmanned Ground Vehicles. In Proceedings of the 2024 14th International Conference on Electrical Engineering (ICEENG), Cairo, Egypt, 21–23 May 2024; pp. 54–59. [Google Scholar] [CrossRef]
Hu, Y.; Dong, T.; Li, S. Coordinating ride-pooling with public transit using Reward-Guided Conservative Q-Learning: An offline training and online fine-tuning reinforcement learning framework. Transp. Res. Part C Emerg. Technol. 2025, 174, 105051. [Google Scholar] [CrossRef]
Liu, J.; Wang, X.; Shen, S.; Yue, G.; Yu, S.; Li, M. A Bayesian Q-Learning Game for Dependable Task Offloading Against DDoS Attacks in Sensor Edge Cloud. IEEE Internet Things J. 2021, 8, 7546–7561. [Google Scholar] [CrossRef]
Du, Y.; Ma, C.; Liu, Y.; Lin, R.; Dong, H.; Wang, J.; Yang, Y. Scalable Model-based Policy Optimization for Decentralized Networked Systems. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; pp. 9019–9026. [Google Scholar] [CrossRef]
Huang, W.; Cui, Y.; Li, H.; Wu, X. Practical Probabilistic Model-Based Reinforcement Learning by Integrating Dropout Uncertainty and Trajectory Sampling. IEEE Trans. Neural Netw. Learn. Syst. 2024, 1–15. [Google Scholar] [CrossRef]
Nowakowski, M. Operational Environment Impact on Sensor Capabilities in Special Purpose Unmanned Ground Vehicles. In Proceedings of the 2024 21st International Conference on Mechatronics—Mechatronika (ME), Brno, Czech Republic, 4–6 December 2024; pp. 1–5. [Google Scholar] [CrossRef]
Pflueger, M.; Agha, A.; Sukhatme, G.S. Rover-IRL: Inverse Reinforcement Learning with Soft Value Iteration Networks for Planetary Rover Path Planning. IEEE Robot. Autom. Lett. 2019, 4, 1387–1394. [Google Scholar] [CrossRef]
Guastella, D.C.; Muscato, G. Learning-Based Methods of Perception and Navigation for Ground Vehicles in Unstructured Environments: A Review. Sensors 2021, 21, 73. [Google Scholar] [CrossRef] [PubMed]
Zhang, K.; Niroui, F.; Ficocelli, M.; Nejat, G. Robot Navigation of Environments with Unknown Rough Terrain Using deep Reinforcement Learning. In Proceedings of the 2018 IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR), Philadelphia, PA, USA, 6–8 August 2018; pp. 1–7. [Google Scholar] [CrossRef]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2019, arXiv:1509.02971. [Google Scholar] [CrossRef]
Zhang, S.; Sun, C.; Feng, Z.; Hu, G. Trajectory-Tracking Control of Robotic Systems via Deep Reinforcement Learning. In Proceedings of the 2019 IEEE International Conference on Cybernetics and Intelligent Systems (CIS) and IEEE Conference on Robotics, Automation and Mechatronics (RAM), Bangkok, Thailand, 18–20 November 2019; pp. 386–391. [Google Scholar] [CrossRef]
Fehér, Á.; Aradi, S.; Bécsi, T.; Gáspár, P.; Szalay, Z. Proving Ground Test of a DDPG-based Vehicle Trajectory Planner. In Proceedings of the 2020 European Control Conference (ECC), St. Petersburg, Russia, 12–15 May 2020; pp. 332–337. [Google Scholar] [CrossRef]
Rahman, M.H.; Gulzar, M.M.; Haque, T.S.; Habib, S.; Shakoor, A.; Murtaza, A.F. Trajectory planning and tracking control in autonomous driving system: Leveraging machine learning and advanced control algorithms. Eng. Sci. Technol. Int. J. 2025, 64, 101950. [Google Scholar] [CrossRef]
Fujimoto, S.; van Hoof, H.; Meger, D. Addressing Function Approximation Error in Actor-Critic Methods. arXiv 2018, arXiv:1802.09477. [Google Scholar]
Evans, B.D.; Engelbrecht, H.A.; Jordaan, H.W. High-Speed Autonomous Racing Using Trajectory-Aided Deep Reinforcement Learning. IEEE Robot. Autom. Lett. 2023, 8, 5353–5359. [Google Scholar] [CrossRef]
Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; et al. Soft Actor-Critic Algorithms and Applications. arXiv 2019, arXiv:1812.05905. [Google Scholar]
Cai, P.; Mei, X.; Tai, L.; Sun, Y.; Liu, M. High-Speed Autonomous Drifting with Deep Reinforcement Learning. IEEE Robot. Autom. Lett. 2020, 5, 1247–1254. [Google Scholar] [CrossRef]
Deshpande, S.V.; R, H.; Ibrahim, B.S.K.K.; Ponnuru, M.D.S. Mobile robot path planning using deep deterministic policy gradient with differential gaming (DDPG-DG) exploration. Cogn. Robot. 2024, 4, 156–173. [Google Scholar] [CrossRef]
Tao, B.; Kim, J.H. Deep reinforcement learning-based local path planning in dynamic environments for mobile robot. J. King Saud Univ.-Comput. Inf. Sci. 2024, 36, 102254. [Google Scholar] [CrossRef]
Wang, Y.; Xie, Y.; Xu, D.; Shi, J.; Fang, S.; Gui, W. Heuristic dense reward shaping for learning-based map-free navigation of industrial automatic mobile robots. ISA Trans. 2025, 156, 579–596. [Google Scholar] [CrossRef] [PubMed]
Li, P.; Chen, D.; Wang, Y.; Zhang, L.; Zhao, S. Path planning of mobile robot based on improved TD3 algorithm in dynamic environment. Heliyon 2024, 10, e32167. [Google Scholar] [CrossRef] [PubMed]
Frauenknecht, B.; Ehlgen, T.; Trimpe, S. Data-efficient Deep Reinforcement Learning for Vehicle Trajectory Control. In Proceedings of the 2023 IEEE 26th International Conference on Intelligent Transportation Systems (ITSC), Bilbao, Spain, 24–28 September 2023; pp. 894–901. [Google Scholar] [CrossRef]
Mock, J.W.; Muknahallipatna, S.S. A Comparison of PPO, TD3 and SAC Reinforcement Algorithms for Quadruped Walking Gait Generation. J. Intell. Learn. Syst. Appl. 2023, 15, 36–56. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; Adaptive Computation and Machine Learning; MIT Press: Cambridge, MA, USA, 1998. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Schulman, J.; Levine, S.; Moritz, P.; Jordan, M.I.; Abbeel, P. Trust Region Policy Optimization. arXiv 2017, arXiv:1502.05477. [Google Scholar] [CrossRef]
Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic policy gradient algorithms. In Proceedings of the Proceedings of the 31st International Conference on International Conference on Machine Learning—Volume 32, Beijing, China, 21–26 June 2014; ICML’14. pp. I–387–I–395. [Google Scholar]
Heess, N.; Hunt, J.J.; Lillicrap, T.P.; Silver, D. Memory-based control with recurrent neural networks. arXiv 2015, arXiv:1512.04455. [Google Scholar]
De La Cruz, C.; Carelli, R. Dynamic Modeling and Centralized Formation Control of Mobile Robots. In Proceedings of the IECON 2006—32nd Annual Conference on IEEE Industrial Electronics, Paris, France, 6–10 November 2006; pp. 3880–3885. [Google Scholar] [CrossRef]
Ng, A.Y.; Harada, D.; Russell, S.J. Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping. In Proceedings of the Sixteenth International Conference on Machine Learning, San Francisco, CA, USA, 27–30 June 1999; ICML ’99. pp. 278–287. [Google Scholar]
Leung, K.; Lühr, D.; Houshiar, H.; Inostroza, F.; Borrmann, D.; Adams, M.; Nüchter, A.; Ruiz del Solar, J. Chilean underground mine dataset. Int. J. Robot. Res. 2017, 36, 16–23. [Google Scholar] [CrossRef]
Codelco. División El Teniente|CODELCO. 2024. Available online: https://dataset.amtc.cl/index.php/overview/ (accessed on 18 April 2025).
Mirjalili, S.; Lewis, A. The Whale Optimization Algorithm. Adv. Eng. Softw. 2016, 95, 51–67. [Google Scholar] [CrossRef]
Ashraf, N.M.; Mostafa, R.R.; Sakr, R.H.; Rashad, M.Z. Optimizing hyperparameters of deep reinforcement learning for autonomous driving based on whale optimization algorithm. PLoS ONE 2021, 16, e0252754. [Google Scholar] [CrossRef]
Alcayaga, J.; Camacho, C.; Durán, F.; Romo, A.P. On the assessment of reinforcement learning techniques for trayectory tracking of autonomous ground robots. In Proceedings of the 2024 IEEE Eighth Ecuador Technical Chapters Meeting (ETCM), Cuenca, Ecuador, 15–18 October 2024; pp. 1–7. [Google Scholar] [CrossRef]

Figure 1. A general framework illustrating the integration of a Long Short-Term Memory (LSTM) network into a Deep Deterministic Policy Gradient (DDPG) architecture. The agent samples batches of historical sequences composed of observations

o_{i}

, actions

a_{i}

, and rewards

r_{i}

accumulated over time

(o_{t}, a_{t}, r_{t})

from the replay buffer. These sequential samples are processed through LSTM to extract temporal features, which are then used as inputs for both the actor and critic networks. The agent, leveraging the enriched temporal representations provided by LSTM, is responsible for generating actions, updating the policy and value networks through gradient-based optimization, and refining target networks to stabilize the learning process.

Figure 1. A general framework illustrating the integration of a Long Short-Term Memory (LSTM) network into a Deep Deterministic Policy Gradient (DDPG) architecture. The agent samples batches of historical sequences composed of observations

o_{i}

, actions

a_{i}

, and rewards

r_{i}

accumulated over time

(o_{t}, a_{t}, r_{t})

from the replay buffer. These sequential samples are processed through LSTM to extract temporal features, which are then used as inputs for both the actor and critic networks. The agent, leveraging the enriched temporal representations provided by LSTM, is responsible for generating actions, updating the policy and value networks through gradient-based optimization, and refining target networks to stabilize the learning process.

Figure 2. Graphical representation of an LSTM memory cell.

Figure 3. Graphical representation of a skid-steer mobile robot (SSMR) model. The notation

F_{i j}

represents forces acting on the

i j

-th tire, where the subscripts

i \in {f, r}

(f for front and r for rear) and

j \in {r, l}

(r for right and l for left) indicate the specific tire location within the SSMR. The parameters a-e and h set the geometric configuration of the chassis, while r denotes the effective wheel radius.

Figure 3. Graphical representation of a skid-steer mobile robot (SSMR) model. The notation

F_{i j}

represents forces acting on the

i j

-th tire, where the subscripts

i \in {f, r}

(f for front and r for rear) and

j \in {r, l}

(r for right and l for left) indicate the specific tire location within the SSMR. The parameters a-e and h set the geometric configuration of the chassis, while r denotes the effective wheel radius.

Figure 4. The neural network architecture for actor and critic used in the proposed DRL algorithms. (a,c) The architecture for a Q-function approximator network as a critic and a deterministic policy

μ

as an actor. (b,d) The architecture for a V function as a critic network and a stochastic policy

π

as an actor. The Recurrent Neural Networks shown in (c,d) incorporate LSTM units into the original architectures, enabling the handling of temporal dependencies and improving performance in environments with partial observability.

Figure 4. The neural network architecture for actor and critic used in the proposed DRL algorithms. (a,c) The architecture for a Q-function approximator network as a critic and a deterministic policy

μ

as an actor. (b,d) The architecture for a V function as a critic network and a stochastic policy

π

as an actor. The Recurrent Neural Networks shown in (c,d) incorporate LSTM units into the original architectures, enabling the handling of temporal dependencies and improving performance in environments with partial observability.

Figure 5. A Simulink diagram illustrating the implementation of the proposed DRL controllers for the SSMR, detailing observations and states within the robotic environment.

Figure 6. Point cloud maps of scanned underground mine tunnels, showcasing the environment used to evaluate the proposed trajectory tracking controllers.

Figure 7. Cumulative and mean rewards obtained throughout the training process of the trajectory tracking controllers in SSMRs.

Figure 8. Simulation results of lemniscate-type trajectory tracking using the proposed DRL-based controllers for an SSMR.

Figure 9. Simulation results of square-type trajectory tracking using the proposed DRL-based controllers for an SSMR.

Figure 10. Field results of motion control using the proposed DRL-based controllers for an SSMR tracking a pre-planned trajectory based on a real mining environment map from an underground mine. The 2D gray map in subfigure (a) is extracted from scanned data from range sensors and stereo cameras and then used to provide waypoints connected as the reference trajectory. These waypoints are strategically connected to ensure alignment with the navigation surface while adhering to the maneuverability constraints of the environment.

Figure 11. A test on robustness against terrain disturbances while tracking a lemniscate-type reference trajectory.

Figure 12. A test on robustness against terrain disturbances while tracking a square-type reference trajectory.

Figure 13. A test of robustness against model parameter variations while tracking a lemniscate reference trajectory.

Figure 14. A test on robustness against model parameter variations while tracking a square reference trajectory.

Table 1. Overview of DRL algorithms compared in this work.

Algorithm	Training Strategy	Policy Type	Value Function	Parameters
Algorithm	Training Strategy	Policy Type	Value Function	Actor	Critic	Total
PPO [48]	On-policy	Stochastic	State, $V (s)$	260	257	517
DDPG [33]	Off-policy	Deterministic	Action–state, $Q (s, a)$	$194 \times 2$	$321 \times 2$	1030
TD3 [37]	Off-policy	Deterministic	Action–state, $Q (s, a)$	$194 \times 2$	$321 \times 3$	1351
SAC [39]	Off-policy	Stochastic	Action–state, $Q (s, a)$	260	$321 \times 3$	1223

Table 2. DRL baseline algorithms and reference key parameters.

Algorithm	Policy Type	Architecture	Shared Actor/ Critic	Replay Buffer	Target Networks	Reference Key Hyperparameters
PPO	Stochastic	Shared MLP	Yes	Yes	No	clip $= 0.2$ , $γ = 0.95$ , $λ = 0.95$
DDPG	Determi-nistic	Separate MLP	No	Yes	Yes	$τ = 0.0095$ , buffer = 500, $γ = 0.95$
TD3	Determi-nistic	Twin Critics and MLP	No	Yes	Yes	$τ = 0.0095$
SAC	Stochastic	Twin Critics and MLP	No	Yes	Yes	$α_{a, c} \approx \times 10^{- 4}$ , $τ = 0.0095$ , $γ = 0.95$

Table 3. Experimental hyperparameters used in proposed DRL approaches.

Hyperparameter	PPO	DDPG	TD3	SAC
Actor learning rate $α_{a}$	$5.1 \times 10^{- 4}$	$5.1 \times 10^{- 4}$	$5.1 \times 10^{- 4}$	$5.1 \times 10^{- 4}$
Critic learning rate $α_{c}$	$3.3 \times 10^{- 4}$	$3.3 \times 10^{- 4}$	$3.3 \times 10^{- 4}$	$3.3 \times 10^{- 4}$
Replay buffer size	$10^{6}$	$10^{6}$	$10^{6}$	$10^{6}$
Mini-batch size	-	500	500	500
Discount factor $γ$	0.95	0.95	0.95	0.95
Target critic update factor $τ$	-	0.0095	0.0095	0.0095
Exploration noise standard deviation	-	0.1	0.1	-
Target noise standard deviation	-	-	0.1	-
Clip factor	0.2	-	-	-
GAE factor $λ$	0.95	-	-	-
Experience horizon	40	-	-	-
Target entropy $\bar{H}$	-	-	-	−5
Maximum number of episodes (stop criterion)	3000	3000	3000	3000
Training episodes (Baseline controller)	1577	2255	806	1146
Training episodes (including LSTM)	2341	2269	2394	1843

Table 4. Control performance assessment of simulated reinforcement learning-based controllers for the tracking of lemniscate-type reference trajectories in SSMRs.

	Position Error
	DDPG	DDPG-LSTM	PPO	PPO-LSTM	TD3	TD3-LSTM	SAC	SAC-LSTM
IAE	81.78	56.46	97.40	242.09	71.50	70.51	71.60	75.65
ISE	238.10	194.97	258.83	432.64	217.13	207.31	225.60	220.95
ITAE	202,538.56	123,507.48	284,380.18	580,853.68	136,298.86	138,150.30	209,967.39	185,991.88
ITSE	10,377.11	8162.27	12,033.01	18,999.91	8526.07	10,163.42	10,435.81	8025.02
	Orientation Error
	DDPG	DDPG-LSTM	PPO	PPO-LSTM	TD3	TD3-LSTM	SAC	SAC-LSTM
IAE	0.037	0.042	0.126	2.175	0.053	0.036	0.447	0.438
ISE	4.37	4.44	8.49	33.95	3.71	4.51	16.27	15.13
ITAE	142.88	87.88	314.58	2979.48	157.12	61.38	1621.39	1106.38
ITSE	228.99	176.80	369.31	1198.27	144.37	172.51	818.69	639.88
	Linear Speed Error
	DDPG	DDPG-LSTM	PPO	PPO-LSTM	TD3	TD3-LSTM	SAC	SAC-LSTM
IAE	0.122	0.065	0.135	3.393	0.136	0.118	0.125	3.792
ISE	1.65	2.48	2.61	43.43	1.97	3.08	2.12	45.31
ITAE	8.09	13.22	22.26	7811.08	5.91	24.79	10.80	8360.84
ITSE	56.48	79.11	92.11	1807.02	55.31	98.97	67.35	1863.66
	Control Action Total Variation
	DDPG	DDPG-LSTM	PPO	PPO-LSTM	TD3	TD3-LSTM	SAC	SAC-LSTM
v	2.46	1.74	11.42	279.87	2.51	1.91	5.77	180.65
$ω$	2.08	2.06	63.66	444.78	2.37	1.96	25.96	423.83
TVu	3.38	2.84	66.82	574.02	3.62	2.90	27.64	494.89

Table 5. The control performance assessment of simulated reinforcement learning-based controllers for the tracking of square-type reference trajectories in SSMRs.

	Position Error
	DDPG	DDPG-LSTM	PPO	PPO-LSTM	TD3	TD3-LSTM	SAC	SAC-LSTM
IAE	202.41	202.49	287.94	73.64	191.47	150.46	192.34	120.87
ISE	230.87	233.17	250.57	130.26	224.44	206.78	222.60	184.81
ITAE	119,410.28	119,991.55	216,329.83	48,484.46	114,600.16	86,749.49	129,734.90	69,540.66
ITSE	4896.98	4910.52	5773.44	2624.34	4754.60	4222.31	4819.21	3784.80
	Orientation Error
	DDPG	DDPG-LSTM	PPO	PPO-LSTM	TD3	TD3-LSTM	SAC	SAC-LSTM
IAE	76.66	80.13	83.22	69.34	76.15	81.99	79.78	77.70
ISE	74.89	78.89	91.93	88.13	75.63	81.00	84.34	80.36
ITAE	26,691.86	28,025.94	31,462.09	24,237.61	26,463.52	29,569.43	28,842.37	27,339.03
ITSE	1272.98	1394.08	1632.97	1475.18	1283.04	1445.09	1494.69	1398.08
	Linear Speed Error
	DDPG	DDPG-LSTM	PPO	PPO-LSTM	TD3	TD3-LSTM	SAC	SAC-LSTM
IAE	0.48	0.58	0.79	1.99	0.56	0.73	0.76	1.58
ISE	3.70	5.61	9.70	18.56	5.82	9.62	10.47	15.16
ITAE	8.40	35.40	92.91	625.97	19.41	78.82	120.80	330.47
ITSE	40.50	73.63	132.13	297.92	68.25	135.32	160.46	226.01
	Control Action Total Variation
	DDPG	DDPG-LSTM	PPO	PPO-LSTM	TD3	TD3-LSTM	SAC	SAC-LSTM
v [m/s]	0.73	0.61	3.76	72.37	0.95	0.70	1.87	42.72
$ω$ [rad/s]	4.92	5.91	18.60	150.31	5.45	5.37	12.79	142.62
TVu	5.38	6.22	19.96	187.30	5.92	5.72	13.36	161.03

Table 6. Performance metrics for evaluating reinforcement learning-based controllers in tracking reference trajectories for SSMRs in underground mining scenarios.

	Position Error
	DDPG	DDPG-LSTM	PPO	PPO-LSTM	TD3	TD3-LSTM	SAC	SAC-LSTM
IAE	$1.32 \times 10^{5}$	$1.65 \times 10^{5}$	$1.99 \times 10^{5}$	$2.05 \times 10^{5}$	$1.14 \times 10^{5}$	80,751.31	$3.68 \times 10^{5}$	$1.16 \times 10^{5}$
ISE	19,762.38	22,754.45	24,896.74	23,961.66	19,162.28	15,543.57	31,968.92	18,481.96
ITAE	$8.37 \times 10^{9}$	$7.85 \times 10^{9}$	$9.57 \times 10^{9}$	$1.23 \times 10^{10}$	$5.31 \times 10^{9}$	$4.94 \times 10^{9}$	$2.47 \times 10^{10}$	$7.35 \times 10^{9}$
ITSE	$4.30 \times 10^{6}$	$4.46 \times 10^{6}$	$3.68 \times 10^{6}$	$5.22 \times 10^{6}$	$3.69 \times 10^{6}$	$3.32 \times 10^{6}$	$7.17 \times 10^{6}$	$4.03 \times 10^{6}$
	Orientation Error
	DDPG	DDPG-LSTM	PPO	PPO-LSTM	TD3	TD3-LSTM	SAC	SAC-LSTM
IAE	167.44	125.41	370.88	329.46	132.32	124.92	347.81	139.57
ISE	661.37	532.26	998.42	935.58	585.71	560.71	1007.60	587.43
ITAE	$1.25 \times 10^{7}$	$1.00 \times 10^{7}$	$2.54 \times 10^{7}$	$2.37 \times 10^{7}$	$9.53 \times 10^{6}$	$8.55 \times 10^{6}$	$2.09 \times 10^{7}$	$1.02 \times 10^{7}$
ITSE	$1.42 \times 10^{5}$	$1.15 \times 10^{5}$	$2.09 \times 10^{5}$	$2.02 \times 10^{5}$	$1.20 \times 10^{5}$	$1.18 \times 10^{5}$	$2.10 \times 10^{5}$	$1.23 \times 10^{5}$
	Linear Speed Error
	DDPG	DDPG-LSTM	PPO	PPO-LSTM	TD3	TD3-LSTM	SAC	SAC-LSTM
IAE	1.15	15.40	1.38	18.39	2.22	5.90	1.17	18.74
ISE	53.52	206.93	61.16	212.38	67.46	125.47	52.24	209.45
ITAE	68,677.78	$1.03 \times 10^{6}$	66,502.34	$7.41 \times 10^{5}$	$1.80 \times 10^{5}$	$3.99 \times 10^{5}$	41,967.74	$8.69 \times 10^{5}$
ITSE	10,988.05	43,595.70	11,666.33	37,410.36	16,649.61	26,655.41	9040.31	38,351.87
	Control Action Total Variation
	DDPG	DDPG-LSTM	PPO	PPO-LSTM	TD3	TD3-LSTM	SAC	SAC-LSTM
v	0.38	0.48	59.02	1261.06	0.30	0.41	25.68	837.36
$ω$	2.61	3.02	293.98	1983.14	2.74	2.26	126.79	1810.06
TVu	2.86	3.28	309.02	2550.60	2.87	2.57	133.57	2140.43

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Alcayaga, J.M.; Menéndez, O.A.; Torres-Torriti, M.A.; Vásconez, J.P.; Arévalo-Ramirez, T.; Romo, A.J.P. LSTM-Enhanced Deep Reinforcement Learning for Robust Trajectory Tracking Control of Skid-Steer Mobile Robots Under Terra-Mechanical Constraints. Robotics 2025, 14, 74. https://doi.org/10.3390/robotics14060074

AMA Style

Alcayaga JM, Menéndez OA, Torres-Torriti MA, Vásconez JP, Arévalo-Ramirez T, Romo AJP. LSTM-Enhanced Deep Reinforcement Learning for Robust Trajectory Tracking Control of Skid-Steer Mobile Robots Under Terra-Mechanical Constraints. Robotics. 2025; 14(6):74. https://doi.org/10.3390/robotics14060074

Chicago/Turabian Style

Alcayaga, Jose Manuel, Oswaldo Anibal Menéndez, Miguel Attilio Torres-Torriti, Juan Pablo Vásconez, Tito Arévalo-Ramirez, and Alvaro Javier Prado Romo. 2025. "LSTM-Enhanced Deep Reinforcement Learning for Robust Trajectory Tracking Control of Skid-Steer Mobile Robots Under Terra-Mechanical Constraints" Robotics 14, no. 6: 74. https://doi.org/10.3390/robotics14060074

APA Style

Alcayaga, J. M., Menéndez, O. A., Torres-Torriti, M. A., Vásconez, J. P., Arévalo-Ramirez, T., & Romo, A. J. P. (2025). LSTM-Enhanced Deep Reinforcement Learning for Robust Trajectory Tracking Control of Skid-Steer Mobile Robots Under Terra-Mechanical Constraints. Robotics, 14(6), 74. https://doi.org/10.3390/robotics14060074

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LSTM-Enhanced Deep Reinforcement Learning for Robust Trajectory Tracking Control of Skid-Steer Mobile Robots Under Terra-Mechanical Constraints

Abstract

1. Introduction

Work Contributions

2. Related Works

3. Theoretical Background

4. Methodology

4.1. Proximal Policy Optimization

4.2. Deep Deterministic Policy Gradient

4.3. Twin Delayed Deep Deterministic Policy Gradient

4.4. Soft Actor–Critic

4.5. Long Short-Term Memory

5. Environment Characterization: Skid-Steer Mobile Robot

6. Reinforcement Learning-Based Tracking Control

6.1. State Observation

6.2. Reward Function Design

6.3. Agent Architecture for Tracking Control

7. Experimental Setup and Results

7.1. Reference Trajectory Setup

7.2. Hyperparameter Configuration and Tuning

7.3. Cumulative Reward Analysis

7.4. Tracking Test in Lemniscate-Type Trajectories

7.5. Tracking Test in Square-Type Trajectory

7.6. Tracking Test in Mining Environment

7.7. Robustness Test Under Terrain Disturbances

7.8. Test on Robustness Against Model Parameters

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI