Highly Self-Adaptive Path-Planning Method for Unmanned Ground Vehicle Based on Transformer Encoder Feature Extraction and Incremental Reinforcement Learning

: Path planning is an indispensable component in guiding unmanned ground vehicles (UGVs) from their initial positions to designated destinations, aiming to determine trajectories that are either optimal or near-optimal. While conventional path-planning techniques have been employed for this purpose, planners utilizing reinforcement learning (RL) exhibit superior adaptability within exceed-ingly complex and dynamic environments. Nevertheless, existing RL-based path planners encounter several shortcomings, notably, redundant map representations, inadequate feature extraction


Introduction
Unmanned ground vehicles (UGVs) have emerged as a significant technological advancement with far-reaching applications across various fields.Their ability to operate without human presence in potentially hazardous or challenging environments makes them invaluable assets [1][2][3].In the military domain, UGVs contribute to safeguarding personnel by undertaking tasks like bomb disposal and reconnaissance in hostile territories.Additionally, their deployment in risky industrial settings, such as hazardous material handling or mining operations, minimizes human exposure to potential dangers.Furthermore, UGVs play an increasingly crucial role in civilian sectors, including agriculture, where they automate tasks like crop monitoring and precision spraying, leading to enhanced efficiency and resource optimization.Therefore, the growing sophistication and versatility of UGVs underscore their importance as transformative tools for improving safety, efficiency, and productivity across diverse domains [4].
The success of UGVs in achieving their designated tasks hinges critically on the efficacy of path-planning algorithms, which play a pivotal role in determining the optimal or nearoptimal trajectory for a UGV to navigate from its starting point to its destination [5,6].By factoring in environmental constraints, obstacles, and potential hazards, path-planning algorithms enable UGVs to operate safely and efficiently.In dynamic environments, pathplanning algorithms become even more crucial, as they must adapt to unforeseen changes in real time, ensuring the UGV avoids collisions and navigates safely.The continued development and refinement of path-planning algorithms are therefore paramount for ensuring the safe, reliable, and successful operation of UGVs across various applications [7].
Path-planning algorithms for UGVs can be mainly divided into traditional classical algorithms [8,9] and learning-based methods [10][11][12].Traditional classic algorithms generally obtain the optimized path in complex environments through a hierarchical architecture incorporating a global planner and a local planner [13].The global planner generates the global path from the starting point to the target point in the entire environment considering the map, obstacles, and other environmental information at a global scale.Then, the local planner uses the global path as a reference and adjusts the path to adapt to dynamic obstacles or other unexpected situations.Classic global path planners include the Dijkstra algorithm [14], Prim algorithm [15], visible graph algorithm [16], probabilistic roadmap algorithm [17], etc., while local path planners include the dynamic window method [18] and time elastic band algorithm [19], etc.Although traditional path planners have a good theoretical foundation and application robustness, they cannot demonstrate good adaptability in extremely complex and dynamic environments due to high-dimensional nonlinear and complicated constraints.In order to solve the above problems, learning-based pathplanning algorithms [20], especially reinforcement learning (RL)-based methods [21], have received increasing attention in recent years.
Path planners based on RL can improve their strategies through continuous interaction with the environment, thereby obtaining an optimized policy.This kind of method can effectively handle high-dimensional-state spaces and is suitable for application in dynamic environments that require the consideration of complex perceptual information, surpassing traditional classical algorithms by a large margin.RL-based path planners generally first perform a mathematical representation of the environment, and then use feature-extraction technology to extract features that are important to path planning from information such as environmental representation or the status of the UGV to form a state vector.Finally, the planner will establish a neural network to implement end-to-end mapping from the state vector to the control instructions.However, existing RL-based path planners are confronted with three drawbacks in terms of map representation, feature extraction, and adaptive capability.
Regarding map representation, prevalent methods employed by existing path planners entail utilizing either a global occupancy map or the direct input of the original camera or 3D lidar points from the immediate environment.For instance, Chen et al. employed a convolutional neural network (CNN) trained on an egocentric local occupancy map to predict optimal steering actions for a robotic system, demonstrating the feasibility of deploying a map-based end-to-end navigation model onto real-world robotic platforms [22].Similarly, Wang et al. introduced an off-road path-planning approach based on deep RL, training the agent within a low-dimensional simulator constructed using occupancy maps [23].Furthermore, Fan et al. proposed a hierarchical RL-based path planner for exploring unknown spaces, utilizing lidar readings alongside iteratively generated occupancy maps as observations for the RL agent [24].However, while employing original occupancy maps or sensor observation data as inputs for UGV path planning facilitates the preservation of environmental information to a large extent, it also presents challenges.The high dimensionality of such data, particularly in the case of high-resolution maps or dense sensor data, can markedly augment the computational complexity of subsequent path-planning algorithms.Consequently, this may compromise the efficiency and real-time performance of the planning system, which are crucial for ensuring safe and dependable autonomous navigation within dynamic environments.
In terms of feature extraction, the majority of existing RL-based path-planning methodologies employ convolutional neural networks (CNNs) or fully connected networks (FCNs) as foundational architectures for information extraction [25,26].This preference is rooted in the remarkable efficacy of CNNs in image processing tasks, rendering them well suited for handling image-like map representations, while FCNs are favored for their versatility and simplicity.For instance, Sartori et al. leveraged a CNN to extract salient features from a top-down environmental image, encompassing obstacle distributions as well as the locations of starting and goal points [27].Similarly, Jin et al. introduced a pyramid path-planning network, amalgamating a CNN with a feature-pooling pyramid structure to extract multiscale features from various hierarchical levels, thereby generating a local feature representation enriched with semantic information [28].Additionally, Qureshi et al. employed an FCN to amalgamate environmental features such as raw point clouds obtained from depth sensors, alongside a robot's initial and target configurations [29].However, notwithstanding their proficiency in local feature extraction, CNNs and FCNs encounter challenges in capturing long-range dependencies, a pivotal aspect for effective path planning.
As for adaptive capability, the prevailing paradigm among RL-based path planners is predicated upon the assumption that the deployed agent will encounter environments possessing similarities to those present during training phases.For instance, Bae et al. conducted training and testing of a deep Q-learning RL agent on congruent grid-like occupancy maps [30].Similarly, Wang et al. introduced a pioneering path-planning methodology for multi-agent systems, integrating flocking control and RL, wherein both training and testing simulations were conducted within the same environments developed utilizing Visual Studio Community software and the Unity3D engine [31].Furthermore, Huang, R. et al. put forward an RL-based path planner for a continuous dynamic simulation environment, where the obstacles were all circle-like in the training and testing environments [32].While this presumption of congruent obstacle features and distributions across training and testing settings facilitates, to some extent, the applicability of RL-based path planners, such scenarios remain uncommon in real-world applications.It is more typical for application scenarios to exhibit significant disparities from training environments, thereby diminishing the performance of trained RL agents when faced with unseen environments.Consequently, there is a pressing need for a highly self-adaptive path-planning framework to augment the generalization capabilities of trained RL agents.
In order to overcome the above three existing drawbacks, this paper proposes a novel RL-based path planner with highly self-adaptive capability combining a Transformer encoder block and incremental reinforcement learning (IRL) using compressed map representation as the input.The contributions of this study include the following three aspects.

•
Firstly, the original 2D map is compressed to a 1D feature vector using an Autoencoder to lighten the computational burden of following the RL path planner.The compressed 1D feature vector can achieve a highly accurate reconstruction of the original 2D map, thus ensuring abundant and ample information is obtained while the input dimension is greatly reduced.

•
Secondly, the Transformer encoder block, which has global long-range dependency analysis capability, is adopted to capture the highly intertwined correlation between UGV status at continuous instances.The results show that the Transformer encoder demonstrates better optimality than a traditional CNN or FCN thanks to its strong feature-extraction capability.• Thirdly, incremental reinforcement learning (IRL) is adopted to improve the path planner's generalization ability when the trained agent is deployed in totally different environments to the training environments.The results show that ICR can achieve 5× faster adaptivity than traditional transfer-learning-based fine-tuning methods.
The remainder of this paper is organized as follows.Section 2 will detail the general framework of the proposed path planner, together with the relevant theoretical foundations involved.Section 3 will validate the proposed method in various simulation environments and compare it with the uniform-sampling-based and transfer learning of fine-tuning methods.Section 4 will conclude the whole paper and point out possible future research directions.

Methodology
The general framework of the proposed path-planning method is shown in Figure 1.Firstly, the RL agent prepares input information for the path-planning decision.The inputs include three aspects, namely the compressed map representation, the target point, and the UGV's status.The compressed map representation is a latent representation of the original 2D map given by the autoencoder, which will be discussed in detail in Section 2.1.Because the first two items are invariant during the whole path-planning process, they are directly fed to the IRL module.The UGV status, including its x-coordinate, y-coordinate, and orientation, will change as the UGV moves; therefore, status information from a past time window with a length of n will be collected and fed into the Transformer encoder.Then, four consecutive Transformer encoder layers will process the UGV status information and fully exploit the temporal correlation of the UGV's movements.The Transformer encoder layer will be introduced in Section 2.2.Finally, the output of the Transformer, together with the two direct inputs, will be processed by the IRL module, where incremental learning will be incorporated to improve the agent's generalization ability in different environments.The process of IRL will be elaborated upon in Section 2.3.achieve 5× faster adaptivity than traditional transfer-learning-based fine-tuning methods.
The remainder of this paper is organized as follows.Section 2 will detail the genera framework of the proposed path planner, together with the relevant theoretical founda tions involved.Section 3 will validate the proposed method in various simulation envi ronments and compare it with the uniform-sampling-based and transfer learning of fine tuning methods.Section 4 will conclude the whole paper and point out possible futur research directions.

Methodology
The general framework of the proposed path-planning method is shown in Figure 1 Firstly, the RL agent prepares input information for the path-planning decision.The in puts include three aspects, namely the compressed map representation, the target point and the UGV's status.The compressed map representation is a latent representation of th original 2D map given by the autoencoder, which will be discussed in detail in Section 2.1 Because the first two items are invariant during the whole path-planning process, they ar directly fed to the IRL module.The UGV status, including its x-coordinate, y-coordinate and orientation, will change as the UGV moves; therefore, status information from a pas time window with a length of n will be collected and fed into the Transformer encoder Then, four consecutive Transformer encoder layers will process the UGV status infor mation and fully exploit the temporal correlation of the UGV's movements.The Trans former encoder layer will be introduced in Section 2.2.Finally, the output of the Trans former, together with the two direct inputs, will be processed by the IRL module, wher incremental learning will be incorporated to improve the agent's generalization ability in different environments.The process of IRL will be elaborated upon in Section 2.3.It is imperative to underscore that classical-sampling-based planners, such as rapidly exploring random trees (RRT), RRT*, or bidirectional-RRT, typically rely on generating samples uniformly distributed across a designated state space.However, these planner commonly confine the UGV within a limited portion of the state space.Consequently, th uniform sampling strategy leads to the exploration of numerous states that have a negli gible influence on the final path.This inefficiency significantly hampers the planning pro cess, particularly within state spaces characterized by high dimensionality and environ ments featuring narrow passages.To address this challenge, the proposed method en deavors to reduce the sampling space from the entirety of the state space to an optima It is imperative to underscore that classical-sampling-based planners, such as rapidly exploring random trees (RRT), RRT*, or bidirectional-RRT, typically rely on generating samples uniformly distributed across a designated state space.However, these planners commonly confine the UGV within a limited portion of the state space.Consequently, the uniform sampling strategy leads to the exploration of numerous states that have a negligible influence on the final path.This inefficiency significantly hampers the planning process, particularly within state spaces characterized by high dimensionality and environments featuring narrow passages.To address this challenge, the proposed method endeavors to reduce the sampling space from the entirety of the state space to an optimal subset, guided by the path-planning outcomes derived from RRT or RRT*.This strategic adjustment notably enhances the efficiency of the path-planning process.For further elaboration, please refer to Section 3.

Autoencoder for Environment Representation
An autoencoder is a type of artificial neural network trained to compress and reconstruct its input data [33].In simpler terms, it aims to learn a compressed representation of the input while still being able to accurately recreate the original data from this compressed version.An autoencoder consists of two main parts, namely an encoder and a decoder.The encoder part takes the input data and compresses it into a lower-dimensional representation, often called the latent space.This compressed version captures the essential features of the input.The decoder part receives the latent space representation and tries to reconstruct the original input data from it.The general framework of the autoencoder is shown in Figure 2.
oration, please refer to Section 3.

Autoencoder for Environment Representation
An autoencoder is a type of artificial neural network trained to compress and reconstruct its input data [33].In simpler terms, it aims to learn a compressed representation of the input while still being able to accurately recreate the original data from this compressed version.An autoencoder consists of two main parts, namely an encoder and a decoder.The encoder part takes the input data and compresses it into a lower-dimensional representation, often called the latent space.This compressed version captures the essential features of the input.The decoder part receives the latent space representation and tries to reconstruct the original input data from it.The general framework of the autoencoder is shown in Figure 2. Let us denote the encoder network and the decoder network as h = f(x) and y = g(h), where x, h, and y represent the input, latent representation, and output, respectively.The encoder and decoder networks can take arbitrary networks, including the FCN, CNN, recurrent neural network (RNN), etc. [34].Because here, we want to compress the twodimensional environment map into the latent space, the CNN is selected as the encoder network considering its superior capability in processing image-like two-dimensional inputs.
Within a convolutional layer, a learnable filter, usually referred to as a kernel, is applied to the map, followed by processing with an activation function.The mathematical formula of the CNN layer can be expressed as [35]   *   ∈ where  represents the jth feature map of the ith convolutional layer; Mj represents the collection of input feature maps; * denotes the convolutional operation; and b represents the additional bias added to the output graph.
After being processed by the CNN layer, the obtained feature map is usually smaller than the original input, mainly due to the stride and convolutional operation.Because the autoencoder tries to reconstruct the original input, the decoder network needs to upsample the feature map output by the encoder network to the same size as the original input.Therefore, transposed convolution, also known as fractionally strided convolution, is introduced into the decoder network.It is essentially the reverse operation of convolution and allows the network to learn to upsample the information and generate a larger image.It achieves this by introducing learnable filters and performing similar element-wise Let us denote the encoder network and the decoder network as h = f (x) and y = g(h), where x, h, and y represent the input, latent representation, and output, respectively.The encoder and decoder networks can take arbitrary networks, including the FCN, CNN, recurrent neural network (RNN), etc. [34].Because here, we want to compress the twodimensional environment map into the latent space, the CNN is selected as the encoder network considering its superior capability in processing image-like two-dimensional inputs.
Within a convolutional layer, a learnable filter, usually referred to as a kernel, is applied to the map, followed by processing with an activation function.The mathematical formula of the CNN layer can be expressed as [35] x where x l j represents the jth feature map of the ith convolutional layer; M j represents the collection of input feature maps; * denotes the convolutional operation; and b represents the additional bias added to the output graph.
After being processed by the CNN layer, the obtained feature map is usually smaller than the original input, mainly due to the stride and convolutional operation.Because the autoencoder tries to reconstruct the original input, the decoder network needs to upsample the feature map output by the encoder network to the same size as the original input.Therefore, transposed convolution, also known as fractionally strided convolution, is introduced into the decoder network.It is essentially the reverse operation of convolution and allows the network to learn to upsample the information and generate a larger image.It achieves this by introducing learnable filters and performing similar element-wise multiplication and summation.Details of the mathematical operation of transposed convolution can be found in Ref. [36].

Transformer Encoder Layer
The Transformer encoder, a core component in many deep learning architectures, utilizes a stacked structure of identical layers.Each layer is composed of two sub-layers: a multi-head attention mechanism and a fully connected feed-forward network [37].The multi-head attention allows the model to attend to relevant parts of the input sequence, while the feed-forward network introduces non-linearity for complex feature extraction [38].Residual connections and layer normalization are implemented around each sub-layer to address vanishing gradients and accelerate training, respectively.This design enables the encoder to effectively capture long-range dependencies within the input data.
At the heart of the Transformer encoder lies the multi-head attention mechanism, as shown in Figure 3.This mechanism splits the model's representation into multiple heads, each acting as a subspace that allows the model to attend to different aspects of the input information.The outputs from these heads are then concatenated, enabling the network to capture a richer and more nuanced understanding of the features within the data.Here scaled dot-product attention is adopted and the computational formulas are shown in Equations ( 2)- (5).
Here, the query matrix Q, key matrix K, and value matrix V are generated by transforming the feature vector matrix X; W Q , W K , and W V are all linear transformation matrices; d is the scaling factor; W Q i , W K i , and W v i are the transformation matrices projecting Q, K, and V into the ith subspace, where i ranges from 1 to h, and h is the total number of subspaces; H i represents the single-head attention values in the ith subspace; and W v i is the transformation matrix used to concatenate the attention values from all subspaces.chines 2024, 12, x FOR PEER REVIEW 6 of 17 multiplication and summation.Details of the mathematical operation of transposed convolution can be found in Ref. [36].

Transformer Encoder Layer
The Transformer encoder, a core component in many deep learning architectures, utilizes a stacked structure of identical layers.Each layer is composed of two sub-layers: a multi-head attention mechanism and a fully connected feed-forward network [37].The multi-head attention allows the model to attend to relevant parts of the input sequence, while the feed-forward network introduces non-linearity for complex feature extraction [38].Residual connections and layer normalization are implemented around each sublayer to address vanishing gradients and accelerate training, respectively.This design enables the encoder to effectively capture long-range dependencies within the input data.
At the heart of the Transformer encoder lies the multi-head attention mechanism, as shown in Figure 3.This mechanism splits the model's representation into multiple heads, each acting as a subspace that allows the model to attend to different aspects of the input information.The outputs from these heads are then concatenated, enabling the network to capture a richer and more nuanced understanding of the features within the data.Here scaled dot-product attention is adopted and the computational formulas are shown in Equations ( 2)-( 5).
, ,   , … , Here, the query matrix Q, key matrix K, and value matrix V are generated by transforming the feature vector matrix X; W Q , W K , and W V are all linear transformation matrices; d is the scaling factor;  ,  , and  are the transformation matrices projecting Q, K, and V into the ith subspace, where i ranges from 1 to h, and h is the total number of subspaces;

Incremental Reinforcement Learning
This paper proposes an RL path-planning model based on incremental learning, namely the incremental collaborative learning knowledge model (ICLKM) [39].For the RL part, we adopt the advanced soft actor-critic (SAC) algorithm.
The SAC algorithm distinguishes itself from other deep RL approaches by incorporating the concept of entropy, a measure of randomness in probability distributions [40].In the context of SAC, entropy reflects the level of stochasticity, unpredictability, or variation within an agent's actions.Higher entropy values signify increased action randomness, resulting in richer exploration of potential actions.This entropy injection encourages policy exploration within the state space, mitigating the risk of becoming trapped in local optima.By enabling the exploration of diverse solution pathways, entropy ultimately enhances the robustness of the final learned policy.The specific mathematical formula for calculating entropy is In the realm of general RL algorithms, the objective centers on acquiring a strategy that maximizes the total accumulated reward received by the agent throughout its interactions with the environment [41].In simpler terms, the goal is to learn a course of action that yields the greatest overall benefit for the agent, namely For the SAC algorithm, in addition to the above general objectives, it is also required that the strategy has the maximum entropy at each output action: Its purpose is to ensure the randomness of the strategy, making each output action as dispersed as possible, and improving the exploration ability of the intelligent agent.The core idea of maximum entropy in motion planning tasks is to sample as many useful trajectories as possible.
In the SAC algorithm, the state value function V π soft (s) can be expressed as The action value function Q π soft (s, a) can be expressed as The SAC algorithm employs a unique network architecture consisting of one actor network and four critic networks.Two of the critics estimate the state-value function V(s), with corresponding target networks V target (s) used for stabilization during training.The remaining two critics focus on the action value function Q(s, a).Notably, the actor network and both Q-networks are synchronously updated using their respective parameters, while their target counterpart needs to be fixed for a period of time before synchronizing the latest parameters with the V(s) network.During training, the experience replay pool D provides samples, and exploration noise ε drawn from a standard normal distribution is added to the actions.The loss function for the Q(s, a) is then defined as The loss function of the actor network is The reparameterization technique was introduced in the process of updating the actor network, which means that Considering the practical applications, it is necessary to limit the output to a certain range, so a flattening tanh function needs to be added to limit the final output to (−1, 1): In the setting of incremental learning in path planning [42], the model will first learn from the sample training set D 0 to obtain a high-performance model M 0 .When new environmental samples appear, the already trained model will be incrementally learned on the gradually emerging sample dataset D t , and the entire model will be continuously updated to obtain M t .To equip SAC with incremental learning capability, The SAC agent adopts a dual network structure of M t ′ and M t .When the model is learning from D t , the first network M t ′ uses knowledge distillation to approximate the current learnt path-planning model M t−1 , helping the model retain learned knowledge and reduce the forgetting of old knowledge when learning new knowledge.Then, the second network M t learns new data, takes the output of the first network as the learning objective, and performs consistency loss on the outputs of the two networks at each training step, enabling the path-planning model to effectively learn new knowledge.Finally, the first network M t ′ and the second network M t learn together.M t ′ uses mean squared error (MSE) loss, consistency loss, and distillation loss to update the internal parameters through backpropagation.M t adopts a knowledge collaboration strategy to update the internal parameters to generate a more adaptive model for path planning.
In the proposed dual network structure, the first network model M t ′ adopts the knowledge distillation method [43].The distillation loss L kd only relies on the previous model M t−1 .It considers the MSE loss calculated between the outputs of M t−1 and M t ′ .Distillation loss L kd is defined as where n is the dimension of the outputs, and τi (x) and τ i (x) represent the outputs of M t ′ and M t−1 , respectively.When learning from D t , the output of the second network model M t using the new dataset D t as the input will be adopted as the learning objective of the first network model M t ′ .By continuously maintaining consistency between the two networks, the path-planning model can effectively learn new knowledge from new data.
Before training, both the first network model M t ′ and the second network model M t use the previous model M t−1 as the pretrained model for initialization.The consistency loss L con of the two networks is where τ * i (x) represents the output of M t .In each step, the first network model M t ′ is trained in a supervised manner by calculating the MSE loss L mse : where τ gt i (x) represents the ground truth target point of the path-planning algorithm.Therefore, M t ′ updates the weights based on the MSE loss L mse , distillation loss L kd , and consistency loss L con between the two networks, and updates the weights of the model M t ′ through backpropagation of the loss function.The complete loss function L all for M t ′ is The weights of the network loss function are balanced by λ and β.At the same time, the weight of the second network M t is frozen, and the weight of the network model M t is updated through a knowledge collaboration strategy to generate a more adaptive model, that is, the process of updating the weight of M t using the exponential moving average of the weights in M t ′ : where θ j t and θ j t ′ represent the weights of the networks M t and M t ′ , respectively, and α is the smoothing coefficient hyperparameter.The parameters are updated at each training step j.
Both the learnt network models, M t ′ and M t , can be used for path planning.However, compared to the first network model M t ′ , the second network model M t uses a knowledge collaboration algorithm to make parameter updates more robust and reflect the learning state of the first network M t ′ , resulting in better prediction performance.Therefore, the network model M t is used as the final model for path planning.

Validation
In this section, we will first introduce the simulation setup information, including the hardware configurations and the training/testing datasets that were used to train and evaluate the agent.Then, the effectiveness of the compressed map representation will be validated, especially its reconstruction capability.Afterwards, a comparison between the proposed Transformer encoder and other commonly used networks will be conducted to highlight the superiority of using the Transformer to extract features.Finally, the proposed method will be compared with traditional uniform path-planning samplers, and its fast adaptability to different environments will be highlighted.

Simulation Setup
Simulations were conducted on a computer with an Intel i9-11980HK CPU and an Nvidia RTX 3080Ti graphics card.To ensure the diversity of the training dataset so that the trained model could have better generalization ability, 200 maze map environments were generated.The size of maze maps was 10 m × 10 m, and the resolution and wall thickness were both 0.4 m.For each generated map, 2000 demonstration paths were generated by the RRT* algorithm.The generated RRT* paths were fed to the path-planning network as the training targets.Figure 4 shows four representative generated maps and the RRT* demonstration paths.

Validation of Environment Compression
The latent representation of the environment has a great impact on the final path planner performance since it incorporates all of the information of the map that the RL agent makes a decision on.If the compressed latent state of the environment can make the reconstructed map very close to the original map, then it is regarded as incorporating sufficient key information of the map representation.
We used the 200 generated maze maps as the training dataset.The size of the original input to the autoencoder was 25 × 25 (because the width of the map was 10 m and the resolution was 0.4 m, the size was 10 m/0.4 m = 25).For the encoder part, two consecutive

Validation of Environment Compression
The latent representation of the environment has a great impact on the final path planner performance since it incorporates all of the information of the map that the RL agent makes a decision on.If the compressed latent state of the environment can make the reconstructed map very close to the original map, then it is regarded as incorporating sufficient key information of the map representation.
We used the 200 generated maze maps as the training dataset.The size of the original input to the autoencoder was 25 × 25 (because the width of the map was 10 m and the resolution was 0.4 m, the size was 10 m/0.4 m = 25).For the encoder part, two consecutive CNN layers, whose kernel size was 3 × 3, padding was 1 on two sides, and stride step was 2, were used.Therefore, the latent representation of the environment was 7 × 7 after being processed by the CNN layers.Then, two transposed convolution layers with the same hyperparameters as the above CNN layer were used in the decoder network to resize the latent representation to the original size of the map.It needs to be highlighted here that because the original map was a binary occupancy map, i.e., each grid was 0 or 1, we also wanted the reconstructed map to have only two values.Therefore, a softmax layer was added to the decoder network, and then values higher than 0.5 were set to 1, while others are set to 0. The MSE error of the original and reconstructed maps at each pixel was defined as the loss to train the autoencoder.
Figure 5 demonstrates some examples of original and reconstructed maps.It needs to be highlighted here that the demonstration includes both training maps and testing maps that are not used in the autoencoder training.It can be seen that the reconstructed maps are very close to the original maps in both the training and testing scenarios, indicating the effectiveness and generalization ability of the trained autoencoder.The results show that the MSEs between the original and reconstructed maps are 0.0144 and 0.0192 on the training and testing datasets.Therefore, the latent state representation generated by the autoencoder can provide sufficient information for following path planning.

Validation of Transformer Encoder Feature Extraction
According to Section 3.2, the environment was represented as a 7 × 7 matrix.Therefore, its flattened form, namely a vector of size 49, was concatenated with the target point configuration, a vector of 3 (representing x, y, and orientation), to be fed into the IRL module directly.For the Transformer encoder, a time window with a length of 10 collecting corresponding UGV status information was set as the input.Here, four Transformer encoder layers were used, and each layer shared an embedding size of 24, a multi-head number of 3, and a feed-forward hidden layer size of 48.The Transformer encoder fully exploited the correlations among the UGV statuses during the last few time windows and passed the most essential information to the following IRL module.The reward r used by

Validation of Transformer Encoder Feature Extraction
According to Section 3.2, the environment was represented as a 7 × 7 matrix.Therefore, its flattened form, namely a vector of size 49, was concatenated with the target point configuration, a vector of 3 (representing x, y, and orientation), to be fed into the IRL module directly.For the Transformer encoder, a time window with a length of 10 collecting corresponding UGV status information was set as the input.Here, four Transformer encoder layers were used, and each layer shared an embedding size of 24, a multi-head number of 3, and a feed-forward hidden layer size of 48.The Transformer encoder fully exploited the correlations among the UGV statuses during the last few time windows and passed the most essential information to the following IRL module.The reward r used by the IRL module is defined as where l represents the loss; α and β represent the scaling coefficients; and e x and e y are the distances between predicted the target point and the recommended target point by RRT* in the x and y directions, respectively.It can be seen from the above equation that the reward is the opposite of the loss, because we wanted to maximize the reward or minimize the loss.Here, we set α = β = 100.In order to verify the effectiveness of the Transformer encoder's feature-extraction ability, three representative network structures, namely CNN, GRU, and long short-term memory (LSTM) networks, were used as the benchmarks.Two evaluation metrics, including path length and smoothness, were used for comparison.Path length refers to the accumulated Euclidean distance of the waypoints on the path.Smoothness is defined according to Ref. [44], and measures the total amount of curvature along a path, and a lower smoothness value indicates a smoother path.Table 1 compares the results of the testing datasets.It can be seen from Table 1 that the proposed method using the Transformer encoder results in a 9.6~16.3%improvement in path length and a 5.1~21.7%improvement in smoothness.

Validation of Effectiveness and Adaptability
Figure 6 demonstrates the loss of two networks, M t ′ and M t , during the training process.It can be seen from the figure that the losses of the two networks share a similar decreasing trend, which means the training is effective and the target point given by the proposed method is close to that of RRT*.It also needs to be highlighted here that the noise in the loss trend of M t is smaller than that of M t ′ .This is because the M t ′ network is the main network that interacts with the environment and uses the reward signal to improve the network performance, while the M t network uses the parameters of M t ′ as the training target.Therefore, the M t ′ network already provides enough beneficial experience to the M t network, thus saving it trial-and-error costs and demonstrating lower fluctuations in the loss curve.
In order to demonstrate the effectiveness of the proposed method, Figure 7 shows the target points generated by the proposed method in two representative scenarios, together with the partially traditional uniform sampling method for comparison, where r represents the ratio of the points generated by the proposed method to the total number of points.It can be seen from Figure 7 that if the target points are totally generated by the proposed method, namely r = 100%, all points are directly guiding the UGV towards the final goal point without redundant meaningless points.However, if partially uniform sampling is incorporated, some noise points deviating from the optimal path will appear, which would decrease the path-planning efficiency and optimality.
noise in the loss trend of Mt is smaller than that of   ′ .This is because the   ′ network is the main network that interacts with the environment and uses the reward signal to improve the network performance, while the Mt network uses the parameters of   ′ as the training target.Therefore, the   ′ network already provides enough beneficial experience to the Mt network, thus saving it trial-and-error costs and demonstrating lower fluctuations in the loss curve.In order to demonstrate the effectiveness of the proposed method, Figure 7 shows the target points generated by the proposed method in two representative scenarios, together with the partially traditional uniform sampling method for comparison, where r represents the ratio of the points generated by the proposed method to the total number of points.It can be seen from Figure 7 that if the target points are totally generated by the proposed method, namely r = 100%, all points are directly guiding the UGV towards the final goal point without redundant meaningless points.However, if partially uniform sampling is incorporated, some noise points deviating from the optimal path will appear, which would decrease the path-planning efficiency and optimality.A typical example of the generated path for the traditional uniform sampling method and the proposed method is shown in Figure 8.It can be seen that the path of the proposed method is much shorter and smoother than that of the uniform sampling method.For a much fairer comparison, Figure 9 compares the two methods for 30 trails to exclude occasionality.The statistical metrics, including execution time, path length, and smoothness, are reported.It can be seen from Figure 9 that for the metrics of path length and smoothness, the proposed method wins a lot more than the uniform sampling method.As for the execution time, although the mean values of the two methods are similar, the proposed method has much lower deviation, indicating its robustness for different random settings.A typical example of the generated path for the traditional uniform sampling method and the proposed method is shown in Figure 8.It can be seen that the path of the proposed method is much shorter and smoother than that of the uniform sampling method.For a much fairer comparison, Figure 9 compares the two methods for 30 trails to exclude occasionality.The statistical metrics, including execution time, path length, and smoothness, are reported.It can be seen from Figure 9 that for the metrics of path length and smoothness, the proposed method wins a lot more than the uniform sampling method.As for the execution time, although the mean values of the two methods are similar, the proposed method has much lower deviation, indicating its robustness for different random settings.much fairer comparison, Figure 9 compares the two methods for 30 trails to exclude occasionality.The statistical metrics, including execution time, path length, and smoothness, are reported.It can be seen from Figure 9 that for the metrics of path length and smoothness, the proposed method wins a lot more than the uniform sampling method.As for the execution time, although the mean values of the two methods are similar, the proposed method has much lower deviation, indicating its robustness for different random settings.The aforementioned simulation validates the efficacy of the proposed method under conditions resembling those of the training settings.However, real-world scenarios frequently diverge significantly from the datasets used for training, leading to a notable decline in the performance of trained agents when employing traditional fine-tuning-based transfer-learning techniques.In contrast, the proposed method embraces an incremental learning framework, enabling rapid adaptation to novel environments by leveraging the richly abstracted experiences gleaned from the training datasets.To substantiate this assertion, a notably complex environment, surpassing the complexity of the training datasets, is employed for verification, as depicted in Figure 10a.The corresponding training loss of both the traditional fine-tuning method and the proposed method is presented in Figure 10b.Notably, the proposed method demonstrates a fivefold increase in adaptivity compared to traditional transfer-learning approaches.For instance, achieving a loss of approximately 20 requires around 44,500 iterations with transfer learning, whereas the proposed method achieves this milestone in only 9000 iterations.Furthermore, the path generated by the proposed method, as illustrated in Figure 10a, exhibits smoothness and a rational path length, further affirming its efficacy.The aforementioned simulation validates the efficacy of the proposed method under conditions resembling those of the training settings.However, real-world scenarios frequently diverge significantly from the datasets used for training, leading to a notable decline in the performance of trained agents when employing traditional fine-tuning-based transfer-learning techniques.In contrast, the proposed method embraces an incremental learning framework, enabling rapid adaptation to novel environments by leveraging the richly abstracted experiences gleaned from the training datasets.To substantiate this assertion, a notably complex environment, surpassing the complexity of the training datasets, is employed for verification, as depicted in Figure 10a.The corresponding training loss of both the traditional fine-tuning method and the proposed method is presented in Figure 10b.Notably, the proposed method demonstrates a fivefold increase in adaptivity compared to traditional transfer-learning approaches.For instance, achieving a loss of approximately 20 requires around 44,500 iterations with transfer learning, whereas the proposed method achieves this milestone in only 9000 iterations.Furthermore, the path generated by the proposed method, as illustrated in Figure 10a, exhibits smoothness and a rational path length, further affirming its efficacy.

Conclusions
This paper introduces a novel highly self-adaptive path-planning methodology grounded in Transformer encoder feature extraction and IRL.Out principal conclusions can be summarized across three key facets:

•
The utilization of an autoencoder facilitates the generation of a compressed representation of the original environment, supplying ample information for subsequent path-planning endeavors while significantly reducing the computational overhead.The MSEs between the original and reconstructed maps amount to 0.0144 and 0.0192 on the training and testing datasets, respectively.

•
Comparative evaluations reveal that the Transformer encoder exhibits superior feature-extraction capabilities in contrast to commonly utilized networks such as the CNN, GRU, and LSTM.Specifically, the proposed methodology employing the Transformer encoder yields a 9.6% to 16.3% enhancement in path length and a 5.1% to 21.7% improvement in smoothness.

•
The proposed methodology demonstrates superior optimality compared to uniformsampling-based approaches and enhanced adaptability relative to traditional transfer-learning-based methodologies.Specifically, the proposed method exhibits a 14.8% reduction in path length and a 61.1% enhancement in smoothness compared to uniform-sampling-based approaches.Furthermore, leveraging incremental learning, the proposed method achieves adaptivity five times faster than traditional transfer-learning approaches.
In future endeavors, our primary objective will be to enhance the credibility and applicability of the proposed method through rigorous real-world testing conducted on a physical UGV.By transitioning from simulated environments to actual field tests, we intend to evaluate the method's real-time performance and robustness under diverse practical scenarios like in [45].Through meticulous data collection and analysis during these tests, we aim to not only validate the effectiveness of the proposed method but also identify any potential limitations or areas for improvement.

Conclusions
This paper introduces a novel highly self-adaptive path-planning methodology grounded in Transformer encoder feature extraction and IRL.Out principal conclusions can be summarized across three key facets: • The utilization of an autoencoder facilitates the of a compressed representation of the original environment, supplying ample information for subsequent path-planning endeavors while significantly reducing the computational overhead.The MSEs between the original and reconstructed maps amount to 0.0144 and 0.0192 on the training and testing datasets, respectively.

•
Comparative evaluations reveal that the Transformer encoder exhibits superior featureextraction capabilities in contrast to commonly utilized networks such as the CNN, GRU, and LSTM.Specifically, the proposed methodology employing the Transformer encoder yields a 9.6% to 16.3% enhancement in path length and a 5.1% to 21.7% improvement in smoothness.• The proposed methodology demonstrates superior optimality compared to uniformsampling-based approaches and enhanced adaptability relative to traditional transferlearning-based methodologies.Specifically, the proposed method exhibits a 14.8% reduction in path length and a 61.1% enhancement in smoothness compared to uniformsampling-based approaches.Furthermore, leveraging incremental learning, the proposed method achieves adaptivity five times faster than traditional transfer-learning approaches.
In future endeavors, our primary objective will be to enhance the credibility and applicability of the proposed method through rigorous real-world testing conducted on a physical UGV.By transitioning from simulated environments to actual field tests, we intend to evaluate the method's real-time performance and robustness under diverse practical scenarios like in [45].Through meticulous data collection and analysis during these tests, we aim to not only validate the effectiveness of the proposed method but also identify any potential limitations or areas for improvement.

Figure 1 .
Figure 1.General framework of the proposed path-planning method.

Figure 1 .
Figure 1.General framework of the proposed path-planning method.

Figure 4 .
Figure 4. Generated maze maps and demonstration path.

Figure 4 .
Figure 4. Generated maze maps and demonstration path.

Figure 5 .
Figure 5.Comparison between original maps and reconstructed maps.(a) Example on training map dataset; (b) example on testing map dataset.

Figure 5 .
Figure 5.Comparison between original maps and reconstructed maps.(a) Example on training map dataset; (b) example on testing map dataset.

Figure 6 .
Figure 6.Training loss trend for the two networks.(a) Training loss for   ′ network; (b) training loss for Mt network.

Figure 6 .Figure 7 .
Figure 6.Training loss trend for the two networks.(a) Training loss for M t ′ network; (b) training loss for M t network.Machines 2024, 12, x FOR PEER REVIEW 13 of 17

Figure 8 .
Figure 8.Comparison of generated paths using proposed method and uniform sampling.(a) Uniform sampling; (b) proposed method.Figure 8. Comparison of generated paths using proposed method and uniform sampling.(a) Uniform sampling; (b) proposed method.

Figure 8 .Figure 9 .
Figure 8.Comparison of generated paths using proposed method and uniform sampling.(a) Uniform sampling; (b) proposed method.Figure 8. Comparison of generated paths using proposed method and uniform sampling.(a) Uniform sampling; (b) proposed method.Machines 2024, 12, x FOR PEER REVIEW 14 of 17

Figure 10 .
Figure 10.Validation of the proposed method on a complicated map.(a) Complicated map and generated path; (b) loss curve.

Funding:
This research was funded by the National Natural Science Foundation of China, grant numbers 52102445 and 52302508.Data Availability Statement: No new data were created.

Figure 10 .
Figure 10.Validation of the proposed method on a complicated map.(a) Complicated map and generated path; (b) loss curve.

Table 1 .
Comparison between different network structures.