1. Introduction
Autonomous navigation and motion controls are essential indicators and core technologies for the autonomy of mobile robots. Mobile robots with autonomous navigation and motion capabilities are widely used in industries such as agriculture, machining, rescue, scientific research, services, and almost all mobile application fields, presenting broad needs and extensive application prospects. In the global revolution of Industry 4.0, increasingly mobile robots assume a critical role in human life. They will also occupy an essential part of industrial production and manufacturing systems. Mobile robots must have the ability to navigate autonomously, make decisions, have good perception, and can accurately complete the tasks required. Furthermore, the working environment of these mobile robots entails various uncertain factors that pose significant challenges to their autonomous navigation and decision-making capabilities.
Wounded rescue missions and critical facility inspections represent typical applications of mobile robots. In such scenarios, to ensure the personal safety of the wounded and the safety of essential facilities, the robot must move quickly, stably, and accurately along the planned trajectory [
1]. To fulfill all mission requirements with minimal tracking error, mobile robot trajectory tracking control algorithms that have superior adaptability, real-time capability, and robustness are often necessary to tackle the different expected trajectories. In this domain, the challenges associated with most trajectory-tracking control algorithms remain a subject of intense research [
2,
3]. Therefore, under the requirements of such tasks, it is imperative to have trajectory tracking control algorithms that can adapt quickly and have real-time capability to cope with dynamic and unpredictable changing environments.
The backstepping controller [
4] demonstrates a global stabilization effect in nonlinear and non-holonomic constrained systems, making it a popular choice for a variety of mobile robot platforms [
5,
6,
7,
8]. It is widely used to control the trajectory tracking of mobile robots. The adoption of a Backstepping controller in mobile robot systems often involves expert tuning to find the optimal control gain. However, modern control theory provides methods and suggestions for tuning the parameter gain of the Backstepping controller. Wang [
9] proposed a new Backstepping controller gain adjustment method that employs compensation terms in the control law to resolve the over-parameterization problem in parameter estimation. Similarly, Wang [
10] designed a nonlinear Backstepping controller gain adjustment method based on linearity, separating the linear part from the nonlinear system to form a linear auxiliary system to determine the additional linear system. The state feedback gain is converted into the Backstepping gain through state feedback to determine the gain of the Backstepping controller for the nonlinear system. However, the high-complexity dynamics, high coupling, and nonlinearity of mobile robots lead to difficulties in parameter gain tuning for the Backstepping controller, which often results in poor real-time performance, low control accuracy, and poor robust performance. This usually necessitates expert-level a priori knowledge to make on-line adjustments to the controller parameter gain as per the task requirements and changes in the mobile robot’s operating environment.
The traditional backstepping controller cannot adjust the parameter gain in real time during the trajectory tracking control process of the mobile robot, making it challenging to achieve precise and optimal control of the robot. To address this issue, some researchers have proposed a trajectory-tracking controller with self-adaptive control. Hu [
11] proposed a robot tracking controller based on self-adaptive backtracking to solve the tracking control problems of inaccurate Robot Kinematics/Dynamics modeling and external environment disturbance. By designing an appropriate Lyapunov function, the stability of the robot tracking controller is guaranteed. Van [
12] proposed and designed a self-adaptive Backstepping robot tracking control algorithm that employs nonsingular fast terminal sliding mode control (NFTSMC) to handle uncertainties, external disturbances, and fault compensation in the robot control system. The controller maintains the merits of NFTSMC with high robustness, fast transient response, and finite-time convergence. However, the primary limitation of this controller is its reliance on prior knowledge of disturbances and uncertainty bounds during the design process. Sun [
13] designed a self-adaptive tracking controller that compensates for errors and automatically tunes parameters based on the nonlinear system of mobile robots to solve the trajectory tracking problem of mobile robots with parameter uncertainty. Based on the dynamic model of the mobile robot, a self-adaptive controller based on state feedback is designed by selecting the appropriate Lyapunov function through the pursuit recursive method so that the mobile robot can gradually track the desired trajectory. However, traditional self-adaptive backstepping control algorithms usually cannot achieve optimal robot control under varying robot environments and task requirements, such as underlying drive control of mobile robots, path tracking, and trajectory tracking, requiring expert knowledge and multiple adjustments to obtain optimal controller parameters.
Reinforcement Learning (RL) methods [
13], a significant branch of artificial intelligence, have developed rapidly in the field of control. The RL algorithm is an iterative learning process that involves continuous interaction between the robot system and its environment. RL is a hybrid control strategy that employs a reward value function to evaluate action quality during interaction between the robots and their environment. Compared with traditional control algorithms, RL has significant potential advantages; it continually learns the optimal control strategy through the reward value function until the optimal solution surfaces. RL can better solve control problems that are difficult to solve in traditional algorithms owing to its unparalleled robustness, real-time performance, and generalization [
14]. Advanced RL methods can address high-dimensional disasters and problems with optimal control policies that are difficult to converge, and achieve exemplary performance in complex environments. Q-learning in RL and its derivatives [
15] are among the most widely used and successful RL algorithms, which have found applications in various domains such as mobile robot path planning [
16,
17], underlying control [
18], game development [
19,
20], natural language processing [
21,
22], autonomous driving [
23,
24], and others.
To summarize, many researchers have made significant contributions to the adaptive control of mobile robots in the field of reinforcement learning. Their work involves adapting traditional control algorithms by optimizing their parameters. For instance, Ignacio [
25] proposed an incremental Q-learning strategy of adaptive PID control for parameter tuning when the system and operating conditions are unknown and variable. The feasibility of the algorithm is verified by simulation and physical experiments. Ignacio [
26] then added a multi-function value update experience replay mechanism to adjust the controller parameters according to the double-q incremental model-free algorithm. After conducting simulation and physical comparison experiments, it has been demonstrated that the proposed algorithm for adjusting the adaptive parameters of the controller has good real-time performance and robustness. However, these experiments only involved ground robots, and thus further research is necessary to determine the algorithm’s feasibility in other scenarios. Cheng [
27] exploited the adaptive controller with fast estimation and active compensation capabilities for continuous state and action spaces, improving the success rate of RL control algorithms in different training environments or external disturbances. The controller has strong robustness and can effectively reduce the training time of the controller. Still, it is difficult to obtain an effective solution for continuous action space or high-dimensional action space problems, and it is easy to fall into the trap of local optimality. Subudhi [
28] proposed an Actor/Critic-LQR adaptive control method based on reinforcement learning for the adaptive control of multi-link flexible manipulators under different load conditions. The algorithm does not depend on the dynamic model of the system, and the method has less computational complexity than other adaptive control algorithms. However, the disturbance between the robot and the external environment is not considered. Khan S G [
29] developed an adaptive control strategy online using a combination of dynamic programming and reinforcement learning for a humanoid robot arm’s two joints (shoulder flexion and elbow flexion). The effect of simulation and physical experiments has been significantly improved. However, compared with the traditional methods, the computation is huge, the control response time is longer, and the requirements for robot hardware are very high.
In the field of RL, the double Q-learning algorithm [
30] is widely used in the industry due to its good learning characteristics, which can effectively reduce the typical overestimation in the Q-learning algorithm. Ou [
31] proposed a new reinforcement learning-based framework to realize the quadrotor autonomous obstacle avoidance method. A double-depth loop is used to solve the error problem of the observation capability of the airborne monocular camera. Behzad [
32] proposed a double Q-learning algorithm to solve the problem of aircraft trajectory optimization. Under the premise of meeting the communication connectivity constraints required for the safe operation of the aircraft, the aircraft can be operated in the shortest time. Through two assumptions: the short-term absence of GPS signal and the problem of the GPS signal missing for a certain period, the feasibility of the algorithm is verified by experiments. Faezeh [
33] uses double Q-learning and a* with an offline policy reinforcement learning algorithm to build a new controller. Compared with the traditional A* and Q-learning algorithms, the algorithm is more reliable and has a stronger ability to avoid collisions with obstacles. Khan [
34] proposed an algorithm based on deep double Q-learning to solve the motion planning problem of robots in complex environments. By using multiple gaits, the robot is trained to minimize the distance between its current position and the training target point. Extensive tests conducted across various terrains have demonstrated the algorithm’s efficacy in all unknown complex environments with 100% performance efficiency.
Robots controlled by RL and backstepping methods must have high control accuracy for typical complex application situations such as casualty rescue and power facility inspection. In summary, the contributions of this paper are highlighted as follows:
Aiming at the trajectory tracking control problem of mobile robots, the idea of artificial intelligence is introduced into the backstepping trajectory tracking controller. A control scheme combining backstepping with reinforcement learning is proposed;
Compared with the traditional algorithm, it is not required for the proposed method to be trained with a large number of samples offline. It achieves high-precision tracking control by fast online learning for controller gain optimization without the expert-level gain adjustment capability.
The new algorithm combines the double Q-learning incremental discretization strategy with a subregional active learning strategy. In order to improve the learning efficiency, an experience replay mechanism and a time–memory function are introduced for online gain adjustment, so that online learning can be completed faster, and optimal control can be realized.
The rest of the paper is organized as follows: In
Section 2, the problem statement is presented. In
Section 3, we point out the details of the online gain tuning of the high-accuracy trajectory tracking controller.
Section 4 presents the simulation and physical experimental results of this control algorithm. Finally, conclusions are drawn in
Section 5.
3. Self-Adaptive Trajectory Tracking Control Algorithm Based on Reinforcement Learning
When the mobile robot performs high-precision tracking along the desired trajectory, each time
of the tracking control is related to the error between the desired trajectory and the mobile robot, which not only satisfies the Markov chain but also changes in real-time. In this paper, we consider the specific task of tracking the desired trajectory as a Markov decision process: at the moment
, the state
of the agent selects and executes an action
according to the optimal policy
, which is used as the parameter gain of the current trajectory tracking controller to realize the adaptive adjustment of the speed and angular velocity of the mobile robot. Then, the state
of the agent at time
, the reward
at the current time
, the action
, and the state
are recombined into a new tuple
. Using the idea of backstepping trajectory tracking control, combined with the Double Q-learning algorithm, state space subregion, action space incremental discretization active learning mechanism, the parameter gain of the backstepping controller is self-adjusted through the online learning strategy to ensure the mobile robot can complete the trajectory tracking task with multiple types and high accuracy. As shown in
Figure 4, the robot trajectory tracking control structure is designed as follows. In the control structure designed in this paper, the agent RL algorithm plays the role of the upper controller. The system interacts with the environment continuously, and the agent selects and executes the optimal action
according to the optimal strategy
. Self-adaptively adjusts the parameter gains of the lower backstepping trajectory tracking controller to achieve high-precision trajectory tracking tasks.
3.1. Experience Replay Mechanism
Mnih [
44] et al. proposed the DQN algorithm based on NIPS in the classical field of RL. On the one hand, the deep neural network is used to approximate the action–value function. On the other hand, the experience reply mechanism [
45] is used to make the agent significantly improve the utilization rate of samples and the learning efficiency. The experience replay mechanism is a simple and effective method, which stores the data obtained by the interaction between the agent and the environment in the form of memory cells
; it then updates the data according to the random selection of samples in
. As shown in
Figure 5.
This similar experience replay mechanism is common in RL algorithms. It is a model-based RL paradigm structure that repeatedly updates the system model and action-value function in a one-step format. However, this paper is based on the backstepping trajectory tracking error model. We can perform ordered iterations of each algorithm according to the trajectory tracking error model, and we can randomly select batches of minibatch from the replay buffer to transform and update the state-action value function of the system. Moreover, the replay buffer area is bounded, and the maximum amount of system memory unit data it can store is , that is, . If the amount of data stored in the replay buffer area reaches the maximum value, the newly stored memory cells will be overwritten from the side of the first stored memory cells, and cycle in turn.
In this paper, the trajectory tracking controller uses the Double Q-learning algorithm to learn the controller parameter gain adjustment online. Before using the experience replay mechanism, the state–action value functions of and need to be randomly initialized. In the daily application of RL algorithms, the initialization of any state-action value function is not a simple and easy task. The establishment of the value function has two advantages for the agent. The first point is that when the value function takes random values, it is crucial in the RL algorithm to facilitate exploration of the unknown areas, and the use of random initialization reduces the possibility of excessive deviation of the state–action value function. The second advantage is that it reduces the number of hyperparameters we input, so algorithm designers do not need expert experience for hyperparameter tuning. The algorithm is friendly to simplify the reproduction of the control algorithm and the ability of cross-platform experiments.
3.2. Incremental Discretization Process
In traditional Q-learning or Double Q-learning algorithms, the state space
is learned interactively with the environment through uniform discretization. However, our proposed incremental discretization learning algorithm is a process of continuously updating and adding state space
through interactive learning. With each interaction, the system state is updated according to the two forms of state described in
Section 2.2.3. If the system state does not change, an optimal action is selected and executed according to Equation (8), and the same subsequent operation is performed
times. As described in the previous section, in this case, the state of the agent remains unchanged, and the action space
needs to be incrementally discretized and refined.
We define a tuple, which stores all the information about the state–action of the agent at the current moment, that is,
. Then, according to the theoretical basis of N-memories, we compare the tuples
at each moment, that is
, to determine whether the state–action information of the agent changes during a certain time interval
. If the assumption is valid, the system is in the time interval
when there is no change. The agent needs to perform incremental discretization processing on the optimal action selected in this time interval to generate a more refined action subset, as shown in
Figure 3. Since the state–action space is in one-to-one correspondence, the discretization of the action space will inevitably lead to the discretization of the state space; if the state of the agent is still the same as the current system state at the next moment, the agent will select and execute the optimal action from the newly created subset of actions.
The incremental discretized state-action active learning mechanism avoids the robot’s exploration of unnecessary areas by exploring a given space area, making the state-action space in the Q table more practical. The control accuracy has been transformed from coarse to more refined and accurate, significantly reducing the number of calculations and providing better real-time performance of the control system. As shown in
Figure 6, the initialized state space is divided into three regions. This is represented by multiple shaded reachable states, which represent sub-regions of the state space after incremental discretization.
3.3. Algorithmic Statement
The pseudo-code of the proposed incremental Double Q-learning self-adaptive trajectory tracking control algorithm is shown in Algorithm 1. Algorithm 1 pseudo-code of self-adaptive trajectory tracking control algorithm.
Algorithm 1 Double Q-Learning Track Tracking Algorithm Pseudo-Code |
1: Input: |
2: Position error of the system at initialization |
3: Initializing the experience replay buffer |
4: Initializing the state space |
5: Initialize , |
6: Loop: |
7: -Action selection using greedy strategies |
8: The current tuple is stored in |
9: Modelling robot kinematics and calculating positional errors |
10: updating→Update Next status , Robot input and reward |
11: if the system is variant then |
12: The current memory cell is stored in the experience replay buffer |
13: Divide the region according to the state space and find the nearest point of to |
14: if is inside then |
15: if is in the (0–0.2] interval, there are four chances to update and |
16: if is in the (0.2–0.5] interval, there are two chances to update and |
17: if is in the (0.5–∞) interval, there are one chance to update and |
18: else Incorporate in Set up and for the newly merged state |
19: end if |
20: else System performs state-action incremental discretization active learning |
21: end if |
22: System startup experience replay mechanism |
23: Update system state with incremental discretization . |
24: end Loop |
In the first line, the hyperparameters of the agent need to be entered. Some of these parameters are common in Q-learning or Double Q-learning. Such as learning rate , reward function , discount factor and exploration strategy . There are also some special symbols mentioned in this article. For example, initialize the coarse action space . Determine the positional relationship between the current system state and the state space according to the value function . is the action space discretization level parameter. The size of the minibatch for initializing experience replay is . The parameter is to determine whether the system is changeable and store the memory unit in . Finally, represent the threshold sizes of the state space and action space of the agent under different discretization levels.
From line 2 to line 5, initialize the state space , the experience replay buffer , and . In this case, the initialization state of the agent is not set randomly; it needs to be measured by some sensors. In this paper, the system’s initial state is the pose measured by the integrated inertial navigation as the initial state .
From line 6 to the end of the loop on line 26, the Double Q-learning incremental discretization active learning algorithm is an infinite loop. First, we use the greedy strategy to select the optimal action in the current state according to and , then selectively update the value function of or according to Equations (9) and (10). Each cycle, update with the probability of to get the optimal action or update with the probability of to get the optimal action. In line 8, the current state, action and action discretization level of the agent have been combined into a tuple in for subsequent determination of whether the system is changing. In lines 9 and 10, updates the trajectory tracking controller to obtain the robot input , the state at the next moment, and the current reward according to the pose error and optimal action at the current moment.
After the agent interacts interactively with the environment, it needs to use to determine whether the system has changed. Then, the statements in lines 11 to 24 are executed. When the system changes, this means that the values stored in are not the same. Therefore, we merge these memory cells into the experience replay buffer. Then, we find the closest point to after dividing in the region of state space . Update the state–action functions and according to the value function and the maximum state discretization . Otherwise, when the system is in the “invariant” condition, the agent can make the robot’s trajectory tracking control accuracy higher through active learning according to the incremental discretization mechanism.
After determining whether the current system has changed, in line 22 uses the experience replay mechanism to update the value functions and . The agent learning efficiency is faster, and the discretization time is shorter, which significantly reduces the possibility of no convergence in the system. Then, in line 23, update the system’s state and incremental discretization level in preparation for the next cycle.
4. Experimental Result
This section aims to verify the robustness, real-time performance, and anti-disturbance capabilities of the incremental discretization self-adaptive trajectory tracking control algorithm. First, we set the hyperparameters of the proposed control algorithm and verified their performance in the Gazebo. Finally, we will perform experiments to compare the proposed algorithm with Fuzzy-Backstepping trajectory tracking and traditional Backstepping trajectory tracking, adding evidence of the algorithm’s feasibility and effectiveness.
4.1. RL Hyperparameter Settings
Before Gazebo and physical experiments, we must address some common hyperparameter problems in experiments. First, the parameter gain of the self-adaptive trajectory tracking controller is the
generated by selecting and executing the optimal action of the agent at each moment. In simulation and physical experiments, the initialization of the controller parameter gains is satisfied by the uniform random distribution
. The controller parameter gain
is uniformly initialized randomly, reducing the hyperparameter of settings. This simplifies the process of setting hyperparameter for the designer and reduces the reliance on them. By relying less on hyperparameter design, the controller parameters become more refined and can achieve better control accuracy and performance. Moreover, the design of the reward function must provide enough valuable information for the entire system to interact with the environment, without disturbing the external environment. In this paper, the reward function is designed as shown in Equation (11).
where,
is a free parameter,
represents the pose error between the robot’s current position and its desired point. However, when the system’s current pose
is close to the desired pose
, the value of the reward function approaches −1, which indicates that the system is approaching the desired pose. Conversely, the reward function will give a relatively lower reward.
In RL, it is necessary to balance exploration and exploitation in order to learn optimal policies. The balance between exploration and exploitation has a significant impact on the learning of system performance, determining whether the agent uses its existing strategy or explores new strategies. Too much exploration will prevent the agent from getting the maximum reward in the short term, and the agent will randomly choose an action that leads to poor rewards. To address this issue, in the early stages of reinforcement learning, the Q table is randomly initialized, resulting in little information about the interaction between the agent and the environment. To ensure systematic learning, optimal strategy selection, and a comprehensive exploration of the environment, a high exploration strategy is adopted. As the agent gains knowledge over time, the learning process begins and the high-exploration strategy is changed to a lower-exploration strategy [
46,
47]. To achieve this transition strategy, we use an exploration-exploitation coefficient that decays over time. It determines the probability of the robot system selecting the optimal action based on the current strategy or selecting a random action via the exploration strategy. This probability can be expressed using an equation as follows (12):
For the simulation and the physical experiment, the values of and are the same.
4.2. Simulation Experiments
The simulation experiment is composed of two parts: a linear simulation experiment and a circular simulation experiment. In the simulation experiments, the start and end positions of the robot are in the same position, ensuring that the robot’s environment is the same in each experiment.
In the incremental discretization Double Q-learning algorithm, we set the remaining hyperparameters: the learning rate ; discount factor . The experience replay mechanism also needs to be initialized, we set the initial buffer to , and the total is ; we use in N-memories. The incremental discretization levels for the state space are , , . , where , and .
To test the feasibility and effectiveness of the algorithm, we conducted simulation experiments using the turtlebot3 in Gazebo, as shown in
Figure 7. The robot input has two control variables
, where
m/s,
rad/s. In addition, the robot uses a combination of an odometer and slam to filter and locate [
48]. The agent state at time
is defined as
, and the initialization of the action state space
is randomly selected.
4.2.1. Linear Trajectory Tracking Simulation Experiment
The linear trajectory tracking simulation experiment defines the linear trajectory as shown below in Equation (13).
The initial pose of the robot is
. In the linear simulation, the trajectory tracking target speed is set to
m/s, with an expected angular velocity of
rad/s. The initial gains for the Backstepping trajectory tracking controller were chosen to be [0.5, 3.51, 2.5]. The self-adaptive trajectory tracking algorithm based on reinforcement learning is illustrated in
Figure 8.
The simulation results of linear trajectory tracking are shown in
Figure 8, which comprises six sub-figures.
Figure 8a shows the simulated position of linear trajectory tracking. This demonstrates that the proposed adaptive trajectory tracking algorithm is effective, allowing the robot to closely follow the expected trajectory.
Figure 8b presents the pose error for the linear trajectory tracking, where the system gradually converges to 0 after about 15 s.
Figure 8c illustrates the linear velocity and angular velocity over time for linear trajectory tracking, which stabilize at about 7 s and 8 s, respectively.
Figure 8d shows a diagram of the parameter gain effect for the linear trajectory tracking controller.
Figure 8e presents a simulation of incremental discretization level for linear trajectory tracking, which reaches the set maximum value over time. Finally,
Figure 8f shows the reward value for the linear trajectory tracking that stabilizes around −1.
4.2.2. Circular Trajectory Tracking Simulation Experiment
The circular trajectory tracking simulation experiment defines the circular trajectory as shown below in Equation (14).
Similar to the linear simulation, the robot’s initial pose is set as
for the circular simulation. The trajectory tracking target velocity is
m/s, with an expected angular velocity of
rad/s. The initial gain of the Backstepping trajectory tracking controller is chosen as [0.5, 3.51, 2.5]. The adaptive trajectory tracking algorithm based on reinforcement learning is illustrated in
Figure 9.
Figure 9 presents the simulation results for circular trajectory tracking.
Figure 9a shows the simulated position for circular trajectory tracking, where the trajectory tracking adaptive algorithm proposed in this paper works effectively and allows the robot to closely follow the expected trajectory.
Figure 9b displays the pose error for circular trajectory tracking, showing that the system gradually converges to 0 after about 9 s.
Figure 9c illustrates the linear velocity and angular velocity over time for circular trajectory tracking, which stabilizes at around 6 s and 5 s, respectively.
Figure 9d shows a diagram of the parameter gain effect for the circular trajectory tracking controller, while
Figure 9e presents the simulation of incremental discretization level for circular trajectory tracking, which reaches the set maximum value over time. Finally,
Figure 9f shows the reward effect diagram for the circular trajectory tracking simulation that stabilizes around −1.
As shown in
Figure 8 and
Figure 9, the results demonstrate that the designed adaptive trajectory tracking controller can enable the mobile robot system to quickly eliminate disturbance errors and achieve accurate trajectory tracking with high precision.
4.3. Physical Experiments
This subsection will validate an actual mobile robot’s proposed self-adaptive trajectory tracking algorithm. The physical experiment was conducted using the Turtlebot3 Waffle differential mobile robot platform, as shown in
Figure 10.
Before beginning the physical experiment, we set the hyperparameters of the RL algorithm and the robot trajectory tracking control system parameters to match those used in the Gazebo environment. The experiment was designed to verify the algorithm’s robustness, adaptability, and anti-disturbance capabilities for the indoor inspection project for mobile robots. In the physical experiment, we verified linear and circular trajectory tracking with a linear velocity of 0.2 m/s.
4.3.1. Physical Experiment with Linear Trajectory Tracking
In the linear trajectory tracking experiment, the initial position of the robot is set to
, with an expected velocity of
m/s and an expected angular velocity of
rad/s. The initial gain for the Backstepping trajectory tracking controller was set at [0.5, 3.51, 2.5]. To confirm the adaptiveness and anti-disturbance performance of the trajectory tracking controller, random disturbances were added to test the proposed algorithm’s superior performance. Experiments involving random human disturbance are conducted to verify that even after experiencing communication failure, localization sensor deviation, or mobile robot slippage, the robot can make corrections to the desired trajectory and perform accurate tracking. The linear experiment results based on the double Q-learning adaptive trajectory tracking control algorithm are shown in
Figure 11.
To verify the stability, robustness, and anti-interference properties of the adaptive trajectory tracking algorithm, an expected line speed of
m/s was set for the mobile robot. As seen in
Figure 11a, when the system experiences disturbances, the robot can quickly and accurately track the desired trajectory.
Figure 11b displays the pose error of the robot while tracking the desired trajectory. The tracking error initially has a large overshoot, but gradually becomes smaller after learning. When the system encounters external disturbances, the error can also rapidly be eliminated, making the system gradually stable.
Figure 11c reveals that at around 5 s and 4 s, respectively, the initially assigned linear and angular velocities begin to converge. When the system is disturbed, the robot’s velocity and angular velocity can change swiftly.
Figure 11d presents the timetable of the trajectory tracking controller parameter gain over time.
Figure 11e illustrates how the system focuses on improving its optimal policies by performing fine discretization operations on the state space and action space through incremental active learning mechanisms.
Figure 11f shows how reward changes over time. After the disturbance is eliminated, the reward value finally stabilizes at around −1, demonstrating that the proposed adaptive trajectory tracking controller can achieve accurate curve tracking with good stability, robustness, and anti-interference performance for the mobile robot system.
4.3.2. Physical Experiment with Circular Trajectory Tracking
In the circular trajectory tracking experiment, the initial pose of the robot was set to
. The expected velocity and angular velocity were set to
m/s and
rad/s, and the initial gain for the Backstepping trajectory tracking controller was set at [0.5, 3.51, 2.5]. Similar to linear trajectory tracking, random perturbations were added to test the superior performance of the proposed algorithm. The results of the circular experiment based on the Double Q-learning adaptive trajectory tracking control algorithm are illustrated in
Figure 12.
To evaluate the stability, robustness, and anti-disturbance capabilities of the adaptive trajectory tracking algorithm, we conducted circular trajectory tracking experiments with an expected linear velocity of 0.2 m/s and angular velocity of 0.1 rad/s for the mobile robot.
Figure 12a shows that the appearance of system disturbance causes the robot to deviate from the desired trajectory, but the adaptive trajectory tracking algorithm quickly and accurately tracks the desired trajectory. In addition,
Figure 12b displays the pose error of the robot during the tracking process, indicating that the tracking error exhibits a large overshoot in the initial stages, but the system gradually stabilizes at about 12 s. When the system experiences external disturbances, the adaptive trajectory tracking controller can rapidly eliminate the error, making the system tend towards a stable state.
Figure 12c illustrates that the linear velocity and angular velocity begin to converge around 8 s and 5 s, respectively. Furthermore, when a disturbance occurs, the system quickly adjusts the robot’s velocity and angular velocity.
Figure 12d indicates that the optimal trajectory tracking controller parameter gain is obtained after approximately 36 s.
Figure 12e shows that as learning progresses, the agent becomes more focused on improving its optimal policy. Through the incremental active learning mechanism, the agent performs fine discretization operations on the state space and action space. Finally,
Figure 12f presents the reward values at each moment during the experiments. After eliminating the disturbance, the final reward value is stable at around −1.
As shown in
Figure 11 and
Figure 12, the results demonstrate that the designed adaptive trajectory tracking controller can enable the mobile robot system to quickly eliminate disturbance errors and achieve accurate trajectory tracking with high precision.
4.4. Comparative Experiments
The traditional backstepping trajectory tracking control algorithm needs to use expert-level prior experience to adjust the controller parameter gain. Therefore, we compare Fuzzy-Backstepping adaptive trajectory tracking controller [
49] used in the laboratory robot platform with the advanced Backstepping-Fractional-Older PID controller [
50] and the Double Q-learning adaptive trajectory tracking control proposed in this paper. The gains of the trajectory tracking controller selected by the laboratory platform are [
5,
5,
3],
,
,
,
,
,
. The gain of the Backstepping-Fractional-Older PID trajectory tracking controller is
,
,
. Choose the square as the desired trajectory for the actual test, the initial pose of the robot was set to
. The expected velocity and angular velocity were set to
m/s. The trajectory comparison result is shown in
Figure 13a, and the error result is shown in
Figure 13b.
Comparing the experimental results, the obtained errors are 0.05 m, 0.08 m, and 0.1 m, respectively. It can be seen that using the algorithm proposed in this paper, the average trajectory tracking error obtained under the same environment and state of the robot is the smallest. In other words, the adaptive trajectory tracking algorithm proposed in this paper demonstrates higher control accuracy and a smaller error under similar conditions in the physical experiment.
5. Conclusions
To address the difficulty of adjusting the parameter gains of the backstepping trajectory tracking controller for mobile robots, this paper proposes an adaptive trajectory tracking control method based on reinforcement learning, according to the characteristics of the backstepping trajectory tracking controller, the double Q-learning learning algorithm and the kinematic model of mobile robots. The improved trajectory tracking online learning algorithm adopts an incremental discrete sub-region fast learning strategy to make the Q-table converge quickly and realize the refinement operation, improving the control accuracy of trajectory tracking. By comparing multiple time memories and experience replay mechanisms, we accelerate both the control learning system and learning process, effectively shortening learning time, so that the optimized algorithm can complete the learning process faster in practical applications and realize the optimal control of trajectory tracking for mobile robots. Finally, the proposed algorithm is verified through both simulation and physical experiments. The experimental results show that the control algorithm can be used to adjust the parameters of the mobile robot trajectory tracking controller online in real time and has good robustness, generalization, real-time, and anti-disturbance capabilities under complex tasks. In addition, we believe that the control algorithm proposed in this paper has broad applicability and can be readily adapted to diverse control systems, such as UAVs, robotic arms, etc.
In future work, we will focus on some interesting topics such as global path planning for mobile robots without SLAM, long-range autonomous navigation, and autonomous obstacle avoidance.