To achieve a more efficient and high-performance, stable navigation strategy in dynamic environments, this paper introduces DPDQN-TER, a closed-loop autonomous navigation method based on a DRL framework. First, the navigation task is formulated as a Markov Decision Process (MDP), which establishes the foundation for state, action, and reward definitions. Building upon this formulation, the proposed method integrates two key components: a multi-branch parametric network architecture and a Transformer-based experience enhancement mechanism. The former decouples continuous parameter generation across different discrete actions to mitigate interference and improve policy stability, while the latter leverages self-attention to capture temporal dependencies in trajectories and enhance the efficiency of experience utilization. Together, these two components form the core of the DPDQN-TER framework for high-performance, stable navigation in dynamic environments.
The autonomous driving task is formulated as a Markov Decision Process (MDP) [
38], defined by the tuple {
S, A, R, γ}, where
S denotes the state space,
A denotes the action space,
R denotes the reward function, and
γ denotes the discount factor, described as follows:
State space(S): The state space includes various types of information, such as the position of the mobile robot, the position of the target point, LiDAR measurements, image features, and local path points. Through feature fusion, these elements form a state vector with a total dimension of 108. Notably, the “LiDAR” (independently and identically distributed, i.i.d., from a uniform distribution U [0.5, 5.0] and “camera” (i.i.d. from a normal distribution N(0,1) channels are synthetically generated and environment-agnostic, serving as abstract placeholders without involving ray casting or actual image rendering.
Action space(A): The method adopts a parameterized action design that combines discrete actions with continuous parameters. The discrete actions are defined as forward, left turn, and right turn, while the continuous parameters are defined as and .
Reward(R): The reward function encourages the robot to move closer to the target, avoid obstacles, and prevent collisions. It also penalizes deviations from the designated path. The objective is to jointly optimize stability and safety.
Discount factor(γ): Strategy optimization considers both immediate and long-term rewards, ensuring that navigation strategies balance short-term responsiveness with global optimality.
The overall architecture of the navigation system is illustrated in
Figure 1. It consists of four stages: state perception, action decision, experience modeling, and strategy optimization. During interaction with the environment, the mobile robot selects an action (
A) and receives a reward (
R) based on its current state (
S). This process generates experience samples that are stored in the replay buffer. Subsequently, the robot processes trajectory samples using a Transformer-based experience enhancement mechanism. The enhanced experience is then passed to a shared state feature extractor (MLP), which produces a unified state representation. This representation is input into both a Q-network and a multi-branch parameter policy network, enabling joint action evaluation and continuous parameter generation. The selected actions interact with the environment to generate new experiences, which are added to the replay buffer. Through this closed-loop cycle, the mobile robot iteratively improves its policy until convergence to an optimal navigation strategy is achieved.
3.1. Improved Multi-Branch Parametric Network Architecture
The improved multi-branch parametric network architecture of DPDQN-TER is illustrated in
Figure 2. The core of the model consists of a shared feature extractor (MLP), a multi-branch parameter policy network, and a Q network. The MLP encodes the input state into a unified state feature vector Z, which is then shared by subsequent networks.
The parallel multi-branch structure contains three independent sub-networks, each responsible for generating continuous parameters for one of the discrete actions: forward, left turn, and right turn. Each sub-network is composed of three layers: an input layer, a hidden layer, and an output layer. The input layer is responsible for receiving the state feature vector Z, which is output by the multi-layer perception (MLP). The hidden layer is constituted by a fully connected layer comprising 64 neurons, employing the ReLU activation function to enhance the nonlinear expressive capability of features. The output layer’s design varies according to the action control objectives of the respective branches. The forward branch outputs continuous linear velocity v, which is constrained within by a Sigmoid activation function; The left and right turning branches output continuous angular velocity ω, which is limited to by a Tanh activation function. In the kinematic model, the instantaneous motion of the agent is determined by the linear velocity and the angular velocity . The sign of indicates the turning direction (left or right), while its magnitude reflects the turning rate. When , the agent exhibits straight-line motion. Based on this property, the parameterized action space is decoupled into three discrete actions with distinct semantic meanings: forward, left turn, and right turn. This results in a minimal and non-redundant partition of the control manifold. This design offers several advantages. First, it aligns each network branch with a specific control behavior, thereby enhancing the interpretability of the learned policy. Second, by retaining and as continuous variables instead of discretizing the turning magnitude, the approach preserves continuity within each branch and allows for fine-grained control. Third, it effectively reduces gradient interference that may arise when opposing turning behaviors are learned simultaneously within a single branch.
This design matches the bidirectional nature of rotational control, ensuring system effectiveness. By constraining parameter ranges, this configuration guarantees the physical plausibility of output parameters and improves training stability and convergence speed. Finally, the continuous parameters are combined with the discrete actions and passed into the Q-network, which evaluates the corresponding action values.
The Q-network is responsible for evaluating the value of actions
. It adopts a fusion input structure comprising three components: The MLP outputs a state feature vector Z. The discrete actions are mapped into 8-dimensional embedding vectors, instead of using the traditional one-hot encoding. This embedding compresses the representation space and enhances the model’s ability to capture action semantics. The parameter input layer directly inputs continuous parameters corresponding to actions, such as linear velocity or angular velocity. These parameters are represented as real scalars, preserving their physical meanings. These three components are concatenated and fed into the backbone network, which provides a comprehensive state-action representation. The backbone is a three-layer feedforward neural network. The first layer is a fully connected layer with 128 units and a ReLU activation function, which extracts higher-order features. The second layer is a fully connected layer with 64 units, also using ReLU to increase nonlinearity. The final layer is a one-dimensional fully connected output layer that predicts the state-action value
of the given combined action. The optimization objective is to minimize the mean-squared Bellman error:
where
is the temporal difference loss function,
is the replay buffer, and
is the target
Q value, calculated as:
where
is the instant reward at time step t,
is the discount factor, and
is the target Q-network. This formulation allows consistent value comparison across parameter combinations of different discrete actions.
The sequence of actions executed by the mobile robot in a given environment is determined by the following process. At each interaction step, the robot receives the current state and generates continuous parameters for each discrete action independently through the multi-branch parameter policy network:
where
denotes the policy network with parameter
, and
is the
discrete action.
Subsequently, the Q network evaluates the state-action value for all possible combined actions
, and selects the combination with the highest Q-value using an ϵ-greedy policy:
where
is the maximization operator. After executing the selected action, the environment transitions to the next state
, and the immediate reward
is received. This interaction experience
is stored in the replay buffer for subsequent training.
3.2. Experience Enhancement Mechanism Based on Transformer
To achieve continuous policy optimization, the Q-network and the multi-branch parameter policy network are iteratively updated during training. Since their effectiveness relies heavily on the experience enhancement mechanism, this paper proposes a Transformer-based approach that reconstructs temporal correlations in experience trajectories through sequential modeling. It captures temporal causality within trajectories while maintaining sample independence. As shown in
Figure 3, the proposed structure consists of two main components: an input embedding module and a multi-attention module.
In the embedding phase, the raw experiences collected during interaction are first organized into ordered trajectory data. Each trajectory is formalized as a multi-modal token sequence X, which is fed in parallel into a Transformer encoder. The state vector is projected into a fixed-dimensional feature vector using a linear layer. Discrete actions are mapped into dense vectors through an embedding lookup table. Continuous parameters are linearly projected and concatenated with the discrete action embedding to form a complete action vector representation. Rewards, as scalars, are also projected linearly to achieve a unified representation. In addition, position and type embeddings are incorporated to help the Transformer distinguish the time step and semantic category (state, action, or reward) of each token. Through these steps, the entire trajectory is represented as a unified input sequence to the Transformer.
In the encoding phase, the token sequence X is processed by a Transformer encoder composed of multiple layers, each containing a multi-head self-attention (MHA) mechanism and a feedforward network (FFN). The MHA performs global modeling of the trajectory sequence.
For each attention head, the query (Q), key (K), and value (V) vectors are constructed as:
The attention weights are computed as:
where
is the self-attention function,
is the activation function,
is the single-head dimension, D is the embedding dimension, and h is the number of attention heads. The concatenated outputs of all heads are linearly transformed to obtain the multi-head attention output:
where
is the multi-head attention output,
is the
attention head output, and
is the output weight matrix. This mechanism automatically extracts structural patterns from historical sequences, identifies critical states such as obstacle avoidance points, turning points, abrupt strategy changes, and path replanning events, and models their long-term dependencies with previous and subsequent experiences. The Transformer processes sequences in parallel and models arbitrary dependency structures, enabling comprehensive representation of high-value information.
In the Transformer output phase, for sequence-level scoring, the Transformer output sequence
is aggregated into a trajectory-level embedding via average pooling:
This embedding is fed into a feedforward scoring network that estimates the importance of each experience for the current policy. The scoring network is a two-layer fully connected network with ReLU activation, producing a scalar score that represents the learning value of each trajectory:
where
are offset vectors of
, respectively. For sampling, these scores are softmax-normalized to form an importance distribution:
where
B is the number of trajectories in the sampled batch. During sampling, experiences are drawn from the replay buffer according to this distribution, with high-scoring experiences assigned greater probability. This allows the system to focus on high-value trajectories under limited training resources. Unlike conventional uniform replay or PER, which rely on local indicators such as TD-error, this method scores entire trajectories using global semantic information captured by the Transformer. As a result, it achieves stronger causal modeling and adaptability, making it particularly suitable for navigation tasks with sparse rewards or delayed feedback.
In summary, this section outlines the training mechanism and optimization process, which integrates trajectory modeling with a Transformer structure. The combination of experience scoring and weighted sampling based on semantic trajectory representation enables mobile robots to better leverage obstacle-avoidance capabilities and improve learning efficiency. At the network optimization level, the joint training of the Q-network and the multi-branch policy network further accelerates convergence and enhances policy stability. Overall, this mechanism provides a training framework with improved temporal perception and generalization for dynamic path planning. Building on this design, we briefly clarify its deployment-time complexity to facilitate a fair interpretation of the subsequent experiments. At deployment, the controller executes a shared backbone, the selected branch head, and a Q head; the Transformer module is used only during training for experience replay and is not part of the control loop. Consequently, the per-step inference complexity remains constant with respect to sequence length, and no online planning or search is involved. The additional computation at deployment is limited to a lightweight branch-head forward pass that is of the same order of magnitude as that of PDQN. This design confines sequence-modeling overhead to the training phase while preserving efficient and stable execution during online control.