High-Accuracy, High-Efﬁciency, and Comfortable Car-Following Strategy Based on TD3 for Wide-to-Narrow Road Sections

: To address trafﬁc congestion in urban expressways during the transition from wide to narrow sections, this study proposed a car-following strategy based on deep reinforcement learning. Firstly, a car-following strategy was developed based on a twin-delayed deep deterministic policy gradient (TD3) algorithm, and a multi-objective constrained reward function was designed by comprehensively considering safety, trafﬁc efﬁciency, and ride comfort. Secondly, 214 car-following periods and 13 platoon-following periods were selected from the natural driving database for the strategies training and testing. Finally, the effectiveness of the proposed strategy was veriﬁed through simulation experiments of car-following and platoon-following. The results showed that compared to human-driven vehicles (HDV), the TD3 and deep deterministic policy gradient (DDPG)-based strategies enhanced trafﬁc efﬁciency by over 29% and ride comfort by more than 60%. Furthermore, compared to DDPG, the relative errors between the following distance and desired safety distance using TD3 could be reduced by 1.28% and 1.37% in simulation experiments of car-following and platoon-following, respectively. This study provides a new approach to alleviate trafﬁc congestion for wide-to-narrow road sections in urban expressways.


Introduction
In the sections of urban expressways where the road width narrows, the reduction in the number of lateral lanes leads to a decrease in the road capacity. In addition, the imperfection of drivers in adjusting their speed may result in an insufficient or excessive response to the expected value, causing frequent acceleration and deceleration [1]. Therefore, such road segments often lead to traffic congestion and reduced ride comfort. Simultaneously, the road environment becomes more complex, limiting overtaking opportunities for vehicles and thereby resulting in car-following (CF) behavior. However, adopting high-accuracy control of shorter driving distances and CF technology with strong generalization can reduce the drivers' workload, improve road traffic efficiency, and thus reduce traffic congestion [2].
CF describes the longitudinal interaction between vehicles traveling on single-lane roads with restricted overtaking [3]. CF models are mainly divided into theory-driven models and data-driven models. Theory-driven CF models rely on a few fixed parameters for theoretically modeling and deducing traffic phenomena, making it difficult to comprehensively consider the influencing factors and resulting in poor model prediction accuracy in complex traffic flows. Standard theory-driven CF models include the IDM model [4], the Newell model [5], and the cellular automaton model [6]. Data-driven CF 2 of 14 models mainly include deep learning CF models and deep reinforcement learning (DRL) CF models. Deep learning models rely on large-scale data and imitate pure data samples, resulting in poor generalization ability. Shared deep learning CF models mainly include the RNN CF model, LSTM CF model, and GA-BP CF model [7][8][9]. In contrast, DRL adopts a self-learning mode, does not require prior samples for pure imitation, and its agent learns through continuous trial and error interaction with the driving environment until the optimal strategy is obtained, demonstrating strong generalization ability [10].
Existing research shows that strategies based on DRL demonstrate superiority in solving various types of CF problems. Gao et al. researched autonomous vehicle CF decision-making based on reinforcement learning, proving that reinforcement learning systems are more adaptable to CF and have a certain degree of interpretability and stability [11]. CF strategies based on reinforcement learning mainly spend time on offline training, and after training, they can be quickly implemented in real-time on vehicles [12]. Ye et al. proposed a decision-making training framework for autonomous driving using DRL. After training, the efficiency of the autonomous vehicles increased by 7.9% compared to the vehicles controlled by the IDM model, verifying the effectiveness of the proposed model in learning driving decisions [13]. Sun et al. established a heavy vehicle adaptive cruise control strategy model based on the deep deterministic policy gradient (DDPG) algorithm, achieving adaptive cruise control objectives for heavy vehicles on roads with different curvatures, and verified the effectiveness and robustness of the model through simulation experiments [14]. Zhu et al. constructed a speed control model during CF based on the DDPG. They verified that the model-controlled vehicle is superior to human drivers and model predictive control (MPC) adaptive cruise control models in terms of safety, efficiency, and comfort, with a running speed of more than 200 times faster than the MPC algorithm during testing [15]. Shi et al. proposed a DRL-based connected autonomous vehicle collaborative control under mixed traffic flow, different penetration rates, and different behaviors of human-driven leading vehicles. The proposed model can effectively complete car tracking and energy-saving tasks [16]. Yan et al. developed a hybrid vehicle CF strategy based on DDPG and cooperative adaptive cruise control, and the proposed strategy improved vehicle tracking performance [17]. Qin et al. developed DDPG and MADDPG CF models considering longitudinal and lateral joint control, using the Ope-nACC dataset to train and test them under straight and curved conditions of highway free flow, and the results showed that the developed models controlled the CF effect better than human-driven vehicles (HDV) [18]. Chen et al. proposed an intelligent speed control method for autonomous vehicles in cooperative vehicle-infrastructure systems based on the DDPG, and through vertical comfort evaluation, the method was effective on rough road surfaces [19].
In summary, CF control strategies based on DRL have made significant progress, but they are mainly limited to road sections with fixed widths. Segments transitioning from wide to narrow roads are bound to encounter increasingly intricate traffic scenarios, necessitating more adept car-following control strategies. Existing research on CF strategies for road sections with variable widths, such as those found in urban expressways, has received insufficient attention, and few studies have been able to balance travel efficiency and ride comfort. To alleviate this problem, this paper develops a CF control strategy based on deep reinforcement learning. This strategy uses a self-learning method to continuously interact with the CF environment to obtain the optimal strategy. The CF periods for congested traffic with variable widths of urban expressways are used for strategies training and testing. The main contributions of this study are as follows: (1) Developing a highaccuracy, high-efficiency, and comfortable CF strategy based on the twin-delayed deep deterministic policy gradient (TD3) suitable for road sections with variable widths of urban expressways; (2) Designing a new multi-objective constraint reward function, which considers safety, travel efficiency, and ride comfort. This function employs the error between the following distance and the desired safety distance, the speed error between the following vehicle and the leading vehicle, and the jerk value, as the variables in exponential functions. This formulation leads to rapid and stable convergence during training. Notably, in instances of collisions during the training process, a substantial penalty is imposed, prompting the agent to learn collision avoidance strategies swiftly and autonomously; (3) Conducting simulation experiments on car-following and platoon-following to verify the effectiveness of the proposed strategy.
The rest of this paper is organized as follows: Section 2 introduces the research methods and data extraction. Section 3 presents the results and discussion. The final section concludes the paper.

CF Strategy Based on TD3
The policy function π of reinforcement learning is defined by Equation (1), which represents the probability of taking action a t in a given environmental state s t at the time t.
Using a policy function π, the agent selects an action a t based on the current environmental state s t at each time step. The agent then transitions to the new state s t+1 according to the state transition probabilities P(s t+1 |s t , a t ) and receives an immediate reward r t from the environment. This process is known as reinforcement learning, as shown in Figure 1. Notably, in instances of collisions during the training process, a substantial penalty is imposed, prompting the agent to learn collision avoidance strategies swiftly and autonomously; (3) Conducting simulation experiments on car-following and platoon-following to verify the effectiveness of the proposed strategy. The rest of this paper is organized as follows: Section 2 introduces the research methods and data extraction. Section 3 presents the results and discussion. The final section concludes the paper.

CF Strategy Based on TD3
The policy function π of reinforcement learning is defined by Equation (1), which represents the probability of taking action t a in a given environmental state t s at the Using a policy function π , the agent selects an action t a based on the current envi- , ) and receives an immediate reward t r from the environment. This process is known as reinforcement learning, as shown in Figure 1. The twin delayed deep deterministic policy gradient (TD3) is a DRL algorithm that is good at handling continuous state-action space problems. The TD3 algorithm is an improvement over the DDPG algorithm by Scott Fujimoto et al., to reduce the bias and variance introduced by function approximation in the actor-critic framework, making the model more stable [20]. In the TD3, there are a total of six neural networks. An actor network is used to fit the policy function π , while critic networks 1 and 2 are used to estimate the action-value function Q i=1 ,2 , with their parameters being independent. The target actor network is used to fit the target policy function  π , and the target critic networks 1 and 2 are used to fit the target action-value function  i=1, 2 Q , reducing the risk of overestimation.
In the TD3, the critic networks 1 and 2 are updated by minimizing the loss function  i Loss( ) and  Q can be calculated using the temporal difference (TD) principle: The twin delayed deep deterministic policy gradient (TD3) is a DRL algorithm that is good at handling continuous state-action space problems. The TD3 algorithm is an improvement over the DDPG algorithm by Scott Fujimoto et al., to reduce the bias and variance introduced by function approximation in the actor-critic framework, making the model more stable [20]. In the TD3, there are a total of six neural networks. An actor network is used to fit the policy function π, while critic networks 1 and 2 are used to estimate the action-value function Q i=1,2 , with their parameters being independent. The target actor network is used to fit the target policy function π , and the target critic networks 1 and 2 are used to fit the target action-value function Q i=1,2 , reducing the risk of overestimation.
In the TD3, the critic networks 1 and 2 are updated by minimizing the loss function Loss(θ i ) and Q can be calculated using the temporal difference (TD) principle: where θ 1 and θ 2 represent the parameters of critic network 1 and critic network 2, respectively, N refers to randomly selecting state transitions from the experience replay buffer as a mini-batch for training, r t represents the reward obtained by taking a particular action in the current state, γ is the future discount reward factor, and ε represents the noise from a Gaussian distribution with a mean of 0 and a variance of σ, and this noise is confined within the range of (−c, c).
The actor network is updated using the policy gradient method: where φ represents the parameters of the actor network. The target actor network and target critic networks 1 and 2 update their parameters: where θ 1 and θ 2 represent the parameters of the target critic network 1 and 2, respectively, τ is the soft update rate, and φ is the parameter of the target actor network.
After multiple experiments, it was found that using a hidden layer with 64 neurons could achieve the best balance between computational cost and accuracy. In the hidden layers of both the actor and critic networks, the ReLU activation function was employed. Additionally, a Tanh activation function was applied to the actor network's output layer to constrain the output actions' boundary values. Furthermore, all the other layers of the actor and critic networks utilized fully connected layers. For more detailed information on the network structure and parameter updates of the TD3, refer to Figure 2.
where θ 1 and θ 2 represent the parameters of critic network 1 and critic network 2, respectively, N refers to randomly selecting state transitions from the experience replay buffer as a mini-batch for training, t r represents the reward obtained by taking a particular action in the current state, γ is the future discount reward factor, and ε represents the noise from a Gaussian distribution with a mean of 0 and a variance of σ  , and this noise is confined within the range of (-) c,c . The actor network is updated using the policy gradient method: where  represents the parameters of the actor network.
The target actor network and target critic networks 1 and 2 update their parameters: where 1 θ and  θ 2 represent the parameters of the target critic network 1 and 2, respectively, τ is the soft update rate, and  is the parameter of the target actor network.
After multiple experiments, it was found that using a hidden layer with 64 neurons could achieve the best balance between computational cost and accuracy. In the hidden layers of both the actor and critic networks, the ReLU activation function was employed. Additionally, a Tanh activation function was applied to the actor network's output layer to constrain the output actions' boundary values. Furthermore, all the other layers of the actor and critic networks utilized fully connected layers. For more detailed information on the network structure and parameter updates of the TD3, refer to Figure 2. The process of CF can be abstracted as a reinforcement learning problem, and the partially observable Markov decision process (POMDP) can be described using the   The process of CF can be abstracted as a reinforcement learning problem, and the partially observable Markov decision process (POMDP) can be described using the parameter tuple (s t , a t , r t , s t+1 ). At a particular time step t, the observed environment state s t consists of the following variables: the speed of the following vehicle v f (t), the distance ∆d l− f (t) to the leading vehicle, and the relative speed ∆v l− f (t). The continuous action of the agent is the acceleration of the following vehicle a f (t) ∈ −2m/s 2 , 2m/s 2 [21]. The update of the new state observation value depends on the vehicle's kinematic Equation (6), where v l (t + 1) is the speed of the leading vehicle at the next moment and T s = 0.08s is the simulation time step.
The TD3 algorithm is used to construct a CF strategy in a simulation environment based on the CF trajectory data. The optimal strategy is obtained through continuous interaction between the agent and the driving environment and the supervision of the reward function signal. The training framework is shown in detail in Figure 3.
 [21]. The update of the new state observation value depends on the vehicle's kinematic Equation (6), where  l v t ( 1) is the speed of the leading vehicle at the next moment and  s T s 0.08 is the simulation time step.
( t+ 1) ( ) + ( ) The TD3 algorithm is used to construct a CF strategy in a simulation environment based on the CF trajectory data. The optimal strategy is obtained through continuous interaction between the agent and the driving environment and the supervision of the reward function signal. The training framework is shown in detail in Figure 3.

Evaluation Metrics for Car-Following Behavior
This study introduces the generalized force model to define the desired safe distance (DSD) ( ) DSD d t [22]. The slighter the error between the following distance and the DSD, the more stable the distance control between the following and leading vehicles, and the higher the safety:

Evaluation Metrics for Car-Following Behavior
This study introduces the generalized force model to define the desired safe distance (DSD) d DSD (t) [22]. The slighter the error between the following distance and the DSD, the more stable the distance control between the following and leading vehicles, and the higher the safety: where t r = 1.2s is the constant headway distance, and d 0 = 2m is the gap between the two cars when the following speed is 0. Time headway (THW) describes the time interval between the leading and following vehicles. Under the premise of meeting safety conditions, a minor time headway means higher traffic efficiency [23], as defined in Equation (8). The jerk value represents the rate of change of acceleration during the CF process. A slighter absolute value of jerk indicates smoother acceleration and deceleration of the following vehicle, leading to increased riding comfort, as defined in Equation (9).

Reward Function
The design goal of the reward function in this study is to control the following vehicle to travel accurately according to the DSD in the section of the urban expressway where the road width changes from wide to narrow while significantly improving road traffic efficiency and ride comfort. As the supervision signal of DRL, the quality of the reward function design directly affects the agent's ability to learn the expected strategy, thus affecting the reinforcement learning algorithm's convergence speed and final performance. Under the premise of ensuring safety, a multi-objective constrained reward function is designed by considering safety, traffic efficiency, and ride comfort. The reward function expression is designed to minimize the difference between the target and observation values.
Considering the safety of the CF behavior and road traffic efficiency, the error between the distance d l− f (t) and DSD d DSD (t) is used to design the reward function r 1 (t). The slighter the error between the distance and DSD, the greater the reward, as defined in Equation (10). In order to make the agent learn to avoid collisions, a penalty function r 1_collision (t) is designed, as shown in Equation (11). The reward function r 2 (t) is also designed using the speed difference between the leading and following vehicles to encourage the vehicle to maintain an appropriate speed difference, as shown in Equation (12). In consideration of ride comfort, the reward function is designed using the ratio of the jerk value to the maximum jerk value, as shown in Equation (13): where ω 1 = 1, ω 2 = 1, and ω 3 = 1 are the weight coefficients, and 22.22 m/s is the speed limit of the urban expressway. In summary, the total reward function expression is as follows: where λ 1 = 0.8, λ 2 = 0.2, and λ 3 = 0.1 are the weight coefficients of each sub-reward function.
The pseudo-code of the CF strategy based on TD3 is shown in Algorithm 1, and its corresponding hyperparameters are detailed in Table 1.

Algorithm 1: Car-Following Strategy Based on TD3
Initialize critic networks Q 1 (s t , a t ), Q 2 (s t , a t ) and actor network π(s t ) With random parameters θ 1 , θ 2 , φ Initialize target networks θ 1 ← θ 1 , θ 2 ← θ 2 , φ ← φ Initialize replay buffer for episode = 1 to M do Initialize random process for action exploration ε 0 Receive initial state s for t = 1, T f do Choose action based on current policy and noise: a t ← π(s t )+ε , ε ∼ N(0, σ) Execute action a t , obtain the reward r t , and enter the next state s t+1 Store the state transition sequence (s t , a t , r t , s t+1 ) in the replay buffer Randomly take a small batch of samples from the replay buffer: The CF trajectory data used in this study was obtained by the ubiquitous traffic eyes (UTE) team of Southeast University through high-altitude aerial photography using drones on multiple urban expressways in Nanjing, China. The speed limit on these urban expressways is 80 km per hour. The drones were set at over 200 m to cover congested and free-flow traffic conditions. Finally, the team used algorithms to extract data from the video footage [24].
The UTE team extracted six datasets, and this study selected datasets 1 and 3 as the data sources. Both datasets were collected on sections of urban expressways where lanes narrow, covering the entire evolution process from free-flow to congested traffic. Dataset 1's lane distribution is shown in Figure 4a, where the number of lanes decreased from 5 to 4, then from 4 to 3. Dataset 3's lane distribution is shown in Figure 4b, where the lanes decreased from 5 to 3 [25]. The database parameters can be found in Table 2, and the congested scenes in the video can be seen in Figure 4c.
This study extracted a total of 214 CF periods and 13 platoon-following periods from the database, which had the following characteristics: (1) The duration of each CF trajectory data was greater than 20 s; (2) To ensure that the vehicles did not change lanes or make sudden turns, the lateral position difference between the leading vehicle and the following vehicle should have been less than 1 m; (3) All vehicles were in congested traffic flow.  25 25 video footage [24]. The UTE team extracted six datasets, and this study selected datasets 1 and 3 as the data sources. Both datasets were collected on sections of urban expressways where lanes narrow, covering the entire evolution process from free-flow to congested traffic. Dataset 1's lane distribution is shown in Figure 4a, where the number of lanes decreased from 5 to 4, then from 4 to 3. Dataset 3's lane distribution is shown in Figure 4b, where the lanes decreased from 5 to 3 [25]. The database parameters can be found in Table 2, and the congested scenes in the video can be seen in Figure 4c.    25 25 This study extracted a total of 214 CF periods and 13 platoon-following periods from the database, which had the following characteristics: (1) The duration of each CF trajectory data was greater than 20 s; (2) To ensure that the vehicles did not change lanes or make sudden turns, the lateral position difference between the leading vehicle and the

Strategies Training
The purpose of training is to enable the agent to interact fully with the environment and obtain the optimal strategy. During the training process, a trajectory was randomly selected from 150 CF trajectories for training for each episode, with the remaining data (64 CF periods and 13 platoon-following periods) utilized as the test dataset. The training was repeated for 1800 episodes. The mean reward referred to the average reward value of all the time steps in a training episode, while the moving mean episode reward was the average reward value of a moving window of size 100. Under the supervision of the reward function signal, the agent continuously interacted with the environment through trial-anderror learning, maximizing the cumulative rewards and eventually reaching convergence.

Strategies Training
The purpose of training is to enable the agent to interact fully with the environment and obtain the optimal strategy. During the training process, a trajectory was randomly selected from 150 CF trajectories for training for each episode, with the remaining data (64 CF periods and 13 platoon-following periods) utilized as the test dataset. The training was repeated for 1800 episodes. The mean reward referred to the average reward value of all the time steps in a training episode, while the moving mean episode reward was the average reward value of a moving window of size 100. Under the supervision of the reward function signal, the agent continuously interacted with the environment through trialand-error learning, maximizing the cumulative rewards and eventually reaching convergence. Figure 5 illustrates the training results based on the TD3 and DDPG. The TD3based strategy began to converge after about 110 episodes and reached convergence after about 235 episodes, with a training duration of 1 h and 5 min. The DDPG-based strategy began to converge after about 168 episodes and reached convergence after about 296 episodes, with a training duration of 1 h and 35 min. Therefore, the TD3-based strategy converged faster and reduced the training time by 31.58% compared to the DDPG, effectively reducing the training costs.

Simulation Results of Car-Following Experiments
In total, 64 CF periods were tested to verify the effectiveness of the proposed strategy on CF behavior, and no collisions occurred during the entire testing process. Figure 6 illustrates that the mean rewards for the TD3 and DDPG strategies in the test results were

Simulation Results of Car-Following Experiments
In total, 64 CF periods were tested to verify the effectiveness of the proposed strategy on CF behavior, and no collisions occurred during the entire testing process. Figure 6 illustrates that the mean rewards for the TD3 and DDPG strategies in the test results were 1.04 and 1.02, respectively, indicating that the mean reward was higher using the TD3 than the DDPG. Table 3 presents the results of all the car-following tests. Compared to the HDV and DDPG, the relative errors between the following distance and the DSD through the TD3 were reduced by 41.82% and 1.28%, respectively, suggesting that the TD3-based strategy offered the highest accuracy. The mean-time headway using the TD3 and DDPG could be reduced by 29.30% and 29.17%, respectively, compared to the HDV, and the mean absolute jerk values were reduced by 60.22% and 64.61% m/s 3 , respectively. This significant reduction in the time headway and absolute jerk values for TD3 and DDPG greatly enhanced the road traffic efficiency and ride comfort. Although the average absolute jerk based on TD3 was slightly larger than the DDPG, the TD3 exhibited a distinct advantage in maintaining the error between the following distance and the desired safety distance. Moreover, it demonstrated higher traffic efficiency. The TD3-based strategy demonstrated the best performance in CF behavior.  Since the error between the initial distance and DSD can cause differences in CF behavior, a CF trajectory was randomly selected from the test dataset for detailed analysis and discussion under three conditions: when the initial distance was equal to, less than, or greater than the DSD.
As indicated by Table 4 and Figures 7-9, the TD3-based strategy consistently demonstrated superior control accuracy regardless of whether the initial distance was equal to, less than, or greater than the DSD. When the initial distance was equal to DSD, the following vehicle using the TD3 and DDPG could immediately drive according to the DSD. Compared with the HDV and DDPG, the relative error using the TD3 reduced by 0.73% and 15.14%, respectively. When the initial distance was less than DSD, the vehicle using the TD3 and DDPG could decelerate to reach the DSD and then drive according to DSD. Compared with HDV and DDPG, the relative error using TD3 reduces by 3.53% and 14.18%, respectively. When the initial distance is greater than the DSD, the vehicle using the TD3 and DDPG could decelerate to reach the DSD and then drive according to the DSD. Compared with the HDV and DDPG, the relative error using the TD3 reduced by 3.35% and 99.49%, respectively. In all three scenarios, the TD3-based strategy proved to be highly accurate, efficient, and comfortable in CF.   Since the error between the initial distance and DSD can cause differences in CF behavior, a CF trajectory was randomly selected from the test dataset for detailed analysis and discussion under three conditions: when the initial distance was equal to, less than, or greater than the DSD.
As indicated by Table 4 and Figures 7-9, the TD3-based strategy consistently demonstrated superior control accuracy regardless of whether the initial distance was equal to, less than, or greater than the DSD. When the initial distance was equal to DSD, the following vehicle using the TD3 and DDPG could immediately drive according to the DSD. Compared with the HDV and DDPG, the relative error using the TD3 reduced by 0.73% and 15.14%, respectively. When the initial distance was less than DSD, the vehicle using the TD3 and DDPG could decelerate to reach the DSD and then drive according to DSD. Compared with HDV and DDPG, the relative error using TD3 reduces by 3.53% and 14.18%, respectively. When the initial distance is greater than the DSD, the vehicle using the TD3 and DDPG could decelerate to reach the DSD and then drive according to the DSD. Compared with the HDV and DDPG, the relative error using the TD3 reduced by 3.35% and 99.49%, respectively. In all three scenarios, the TD3-based strategy proved to be highly accurate, efficient, and comfortable in CF.

Simulation Results of Platoon-following Experiment
In congested traffic flow, most vehicles travel in a platoon with multiple vehicles following each other. To further verify the effectiveness of the proposed strategy in this paper, thirteen platoon-following simulation experiments were conducted, each containing from five to nine vehicles. The topology structure of the platoon-following was the predecessor-following communication topology [26], as shown in Figure 10. The leading vehicle in each platoon-following was an HDV, and the initial values of the following vehicles in the simulation were derived from the trajectory data. No collisions occurred during all the platoon-following simulation experiments.  Table 5 presents the results of the simulation tests for all the platoons. In these tests, the TD3-based strategy demonstrated superior control accuracy in the platoon-following simulations. When compared to the HDV and DDPG, the mean error between the following distance of the following vehicles and the DSD was reduced by 1.37% and 41.12%, respectively, when using the TD3. Furthermore, the mean-time headway of the following vehicles could be reduced by 31.59% and 31.10% when using the TD3 and DDPG, respectively, compared to the HDV, leading to a significant enhancement in the traffic efficiency. Lastly, the mean absolute jerk of the platoons could be reduced by 81.26% and 83.08% when using the TD3 and DDPG, respectively, resulting in a substantial improvement in the ride comfort. The results of a randomly selected platoon-following experiment are presented in Table 6 and Figures 11-13. We could observe that the platoon using the TD3 could travel along the desired trajectory. The mean error between the following distance of the follow-

Simulation Results of Platoon-Following Experiment
In congested traffic flow, most vehicles travel in a platoon with multiple vehicles following each other. To further verify the effectiveness of the proposed strategy in this paper, thirteen platoon-following simulation experiments were conducted, each containing from five to nine vehicles. The topology structure of the platoon-following was the predecessor-following communication topology [26], as shown in Figure 10. The leading vehicle in each platoon-following was an HDV, and the initial values of the following vehicles in the simulation were derived from the trajectory data. No collisions occurred during all the platoon-following simulation experiments.

Simulation Results of Platoon-following Experiment
In congested traffic flow, most vehicles travel in a platoon with multiple vehicles following each other. To further verify the effectiveness of the proposed strategy in this paper, thirteen platoon-following simulation experiments were conducted, each containing from five to nine vehicles. The topology structure of the platoon-following was the predecessor-following communication topology [26], as shown in Figure 10. The leading vehicle in each platoon-following was an HDV, and the initial values of the following vehicles in the simulation were derived from the trajectory data. No collisions occurred during all the platoon-following simulation experiments.  Table 5 presents the results of the simulation tests for all the platoons. In these tests, the TD3-based strategy demonstrated superior control accuracy in the platoon-following simulations. When compared to the HDV and DDPG, the mean error between the following distance of the following vehicles and the DSD was reduced by 1.37% and 41.12%, respectively, when using the TD3. Furthermore, the mean-time headway of the following vehicles could be reduced by 31.59% and 31.10% when using the TD3 and DDPG, respectively, compared to the HDV, leading to a significant enhancement in the traffic efficiency. Lastly, the mean absolute jerk of the platoons could be reduced by 81.26% and 83.08% when using the TD3 and DDPG, respectively, resulting in a substantial improvement in the ride comfort. The results of a randomly selected platoon-following experiment are presented in  Table 5 presents the results of the simulation tests for all the platoons. In these tests, the TD3-based strategy demonstrated superior control accuracy in the platoon-following simulations. When compared to the HDV and DDPG, the mean error between the following distance of the following vehicles and the DSD was reduced by 1.37% and 41.12%, respectively, when using the TD3. Furthermore, the mean-time headway of the following vehicles could be reduced by 31.59% and 31.10% when using the TD3 and DDPG, respectively, compared to the HDV, leading to a significant enhancement in the traffic efficiency. Lastly, the mean absolute jerk of the platoons could be reduced by 81.26% and 83.08% when using the TD3 and DDPG, respectively, resulting in a substantial improvement in the ride comfort.
The results of a randomly selected platoon-following experiment are presented in Table 6 and Figures 11-13. We could observe that the platoon using the TD3 could travel along the desired trajectory. The mean error between the following distance of the following vehicles and the DSD using the TD3 was 0.87%, with the highest control accuracy, and the road traffic efficiency and ride comfort were significantly improved.

Conclusions
This study proposes a high-accuracy, high-efficiency, and comfortable CF strategy based on the TD3 for wide-to-narrow road sections in urban expressways. The results indicate that the following vehicle using the TD3, compared to the HDV and DDPG, can accurately drive according to the DSD while maintaining high traffic efficiency and ride comfort. In the test dataset of CF and platoon-following simulations, the traffic efficiency and ride comfort increased by over 29% and 60%, respectively, when using the TD3 and DDPG, compared to the HDV. The TD3-based strategy exhibited the highest control accuracy, with mean relative errors between the following distance and DSD during driving being 0.96% and 1.10%, respectively. The primary errors were due to the initial distance discrepancy with the DSD. When the initial distance equals the DSD, the vehicle using the TD3 drives according to the DSD immediately, maintaining a low jerk value and high ride comfort. If the initial distance is less or more than the DSD, the vehicle using the TD3 will decelerate or accelerate to reach the DSD, with a brief jerk value fluctuation during this transition. Once the DSD is achieved, the jerk value remains small and stable near zero, indicating that the TD3-based strategy significantly enhances the ride comfort under varying conditions. Future research can expand on this study in several ways. Firstly, this study considers two datasets of urban expressways with wide-to-narrow road sections, and additional similar scenarios can be incorporated. Secondly, while this study focuses on longitudinal following strategies, future research could include lateral control and robustness studies of deep reinforcement learning CF strategies. Lastly, this study relies on CF trajectory data for simulation tests, and further hardware-in-the-loop or real vehicle platform tests could be conducted to validate the proposed strategy's effectiveness.

Conclusions
This study proposes a high-accuracy, high-efficiency, and comfortable CF strategy based on the TD3 for wide-to-narrow road sections in urban expressways. The results indicate that the following vehicle using the TD3, compared to the HDV and DDPG, can accurately drive according to the DSD while maintaining high traffic efficiency and ride comfort. In the test dataset of CF and platoon-following simulations, the traffic efficiency and ride comfort increased by over 29% and 60%, respectively, when using the TD3 and DDPG, compared to the HDV. The TD3-based strategy exhibited the highest control accuracy, with mean relative errors between the following distance and DSD during driving being 0.96% and 1.10%, respectively. The primary errors were due to the initial distance discrepancy with the DSD. When the initial distance equals the DSD, the vehicle using the TD3 drives according to the DSD immediately, maintaining a low jerk value and high ride comfort. If the initial distance is less or more than the DSD, the vehicle using the TD3 will decelerate or accelerate to reach the DSD, with a brief jerk value fluctuation during this transition. Once the DSD is achieved, the jerk value remains small and stable near zero, indicating that the TD3-based strategy significantly enhances the ride comfort under varying conditions. Future research can expand on this study in several ways. Firstly, this study considers two datasets of urban expressways with wide-to-narrow road sections, and additional similar scenarios can be incorporated. Secondly, while this study focuses on longitudinal following strategies, future research could include lateral control and robustness studies of deep reinforcement learning CF strategies. Lastly, this study relies on CF trajectory data for simulation tests, and further hardware-in-the-loop or real vehicle platform tests could be conducted to validate the proposed strategy's effectiveness.