Next-Hop Relay Selection for Ad Hoc Network-Assisted Train-to-Train Communications in the CBTC System

In the communication-based train control (CBTC) system, traditional modes such as LTE or WLAN in train-to-train (T2T) communication face the problem of a complex and costly deployment of base stations and ground core networks. Therefore, the multi-hop ad hoc network, which has the characteristics of being relatively flexible and cheap, is considered for CBTC. However, because of the high mobility of the train, it is likely to move out of the communication range of wayside nodes. Moreover, some wayside nodes are heavily congested, resulting in long packet queuing delays that cannot meet the transmission requirements. To solve these problems, in this paper, we investigate the next-hop relay selection problem in multi-hop ad hoc networks to minimize transmission time, enhance the network throughput, and ensure the channel quality. In addition, we propose a multiagent dueling deep Q learning (DQN) algorithm to optimize the delay and throughput of the entire link by selecting the next-hop relay node. The simulation results show that, compared with the existing routing algorithms, it has obvious improvement in the aspects of delay, throughput, and packet loss rate.


Introduction
Recently, with the rapid development of urbanization, urban rail transit has become one of the main transportations. With the development of technology, the communicationbased train control (CBTC) system plays an important role in urban rail transit to guarantee the safe operation of rail trains [1]. To ensure their safety and reliability, CBTC systems have strict requirements on transmission delay and channel quality [2]. Long communication delay and link interruptions may lead to emergency brakes or collisions [3]. Therefore, it is crucial to design a CBTC communication system with low latency and high channel quality.
In traditional CBTC systems, long-term evolution for metro (LTE-M) and wireless local area networks (WLANs) are more widely used in train-to-wayside communication [4]. The train information is first transmitted to the ground-zone controller (G-ZC), which is used to generate control commands for all trains in its management area [5]. After obtaining the commands, the wayside node sends the commands back to trains. However, due to the huge computational burden of the G-ZC and non-direct transmission link [6], the transmission delay of important control commands is excessively large. Therefore, the T2T direct transmission approach was proposed [4], while the G-ZC was also changed to an onboard-zone controller (On-ZC). Unlike the G-ZC, the On-ZC only needs to generate its own commands, which greatly reduces computation latency.
Although the direct T2T transmission greatly reduces latency [7], if we continue to use WLAN or LTE, interruptions and delays caused by hard handoff at the base station boundary are still unavoidable. In addition, the deployment of the terrestrial core network 1.
We formulate the next-hop relay selection problem in a CBTC system. The goal is to select relay nodes with low transmission delay and high throughput in both the train and the wayside node communication range. Meanwhile, in order to balance the single-hop transmission delay and the whole-link hop count, we propose the concept of "hop tradeoff" to minimize the entire link latency.

2.
To handle the time-varying channel state and node congestion, we propose a DRL algorithm to optimize the long-term system reward. Using a multiagent approach [14], all nodes are trained centrally with dueling DQN [19], and then each node makes the next-hop decision individually, in order to avoid nodes with a long queuing delay and poor channel quality. 3.
Lastly, we conduct simulations with a different number of nodes between two trains and different buffer sizes. Meanwhile, the proposed algorithm is compared with several existing algorithms in terms of whole-link delay, packet loss rate, and throughput. The simulation results indicate that the proposed scheme works well against congested networks. In particular, it also significantly superior to other routing algorithms in the aspects of whole-link delay, throughput, and packet loss rate.
The remainder of this paper is organized as follows: in Section 2, some related work about routing selection in ad hoc networks is introduced; in Section 3, we present a multihop relay selection model for ad hoc networks in CBTC systems; the joint optimization problem of channel throughput and total-link delay is formulated in Section 4; then, we introduce the multiagent deep reinforcement method to solve the formulated problem in Section 5; some simulation results and analyses are presented in Section 6; in Section 7, we conclude the paper and propose some future work.

Traditional Communication Method in CBTC System
In the traditional CBTC system, WLAN is widely used in the communication between trains and the wayside base station. Zhu et al. [20] proposed a WLAN-based redundant connect scheme for train-ground communication. The train connects the backup link and active link simultaneously to deal with the interruption at the coverage boundaries of the two access nodes. However, many WLAN standards based on IEEE802.11 are not suitable for high-speed mobile environments [5]. Meanwhile, WLAN is in the open frequency band, which can easily be interfered with by other devices [21]. LTE has strong anti-interference ability and is more stable in switching between access nodes; thus, LTE-based approaches have been proposed in the CBTC system. In [6], a sensing-based semi-persistent scheduling method for LTE-based T2T communication was proposed, which greatly improved the transmission delay of system safety information. However, both LTE and WLAN have the problem of packet loss and delay due to the switch between the access nodes, in addition to the high cost of the base stations and ground core network. Since wireless ad hoc networks do not require a fixed infrastructure, their deployment is more flexible and cheaper. Therefore, ad hoc networks are also a better choice for T2T communication.

Traditional Ad Hoc Network Route Selection
In ad hoc network application, the packet routing is critical for optimizing transmission delay, throughput, and packet loss. There are two types of routing methods commonly used in ad hoc networks: proactive routing and reactive routing. In proactive routing, the OLSR protocol [12] is to store the information of each relay node into the routing table by sending HELLO packets in advance, and then select the shortest path from routing table. The DSDV protocol [13] uses the Bellman-Ford algorithm to select relay nodes in the routing table. Although these approaches allow the optimal route to be selected, they require a large amount of information to be exchanged among all nodes. Especially when the nodes are dynamic, this leads to a rapid increase in the amount of information exchanged. Nevertheless, in the CBTC system, the trains move rapidly, and transmission latency is strict; thus, it is more suitable to use reactive routing.
In reactive routing, nodes cannot know global information of the whole network and can only make decisions for selecting the next-hop node. In [22], the GPSR protocol was derived, and the node which is closest to the destination within the communication range was selected as next-hop node. This minimizes the number of hops in the entire link, thus reducing latency. In order to solve the high outage probability caused by the high-

System Model
As shown in Figure 1, we consider a T2T communication over a multi-hop wireless ad hoc network. In this scenario, since the coverage of one hop is very limited, the train needs the assistance of wayside relays for multi-hop transmission. There are multiple relays within the communication range of each train and wayside node; hence, they need to select the most suitable next-hop wayside relays among these candidate nodes. For example, R 2 may communicate with R 3 , R 4 , and R 5 , but R 3 is chosen as the next hop node by considering factors such as channel quality and delay.
Therefore, for N trains running on the rail, denoted as T = {T 1 , T 2 , . . . , T n , . . . T N }, there are M wayside relays distributed beside the rail, denoted as R = {R 1 , R 2 , . . . , R m , . . . R M }. The train has high mobility; in order to ensure the quality of the T2W transmission, we assume that there are two orthogonal frequency bands available: band 1 for train-to-wayside (T2W) transmission [28] and band 2 for wireless wayside-to-wayside (W2W) transmission. Since the transmission is on two orthogonal channels, there is no interference between T2W and W2W transmission, while multiple W2W transmissions at the same time will cause interference.  The train has high mobility; in order to ensure the quality of the T2W transmission, we assume that there are two orthogonal frequency bands available: band 1 for train-to-wayside (T2W) transmission [28] and band 2 for wireless wayside-towayside (W2W) transmission. Since the transmission is on two orthogonal channels, there is no interference between T2W and W2W transmission, while multiple W2W transmissions at the same time will cause interference.
In multi-hop transmission, all relays follow the decode-and-forward (DF) principle. Furthermore, we assume that the whole system is stationary in the time slot t , and that the transmit power of the nodes does not change. All channels follow quasi-static Rayleigh fading, such that the channel gain between node a and node b can be represented as follows: is the fading coefficient, and , a b d and β indicate the distance between two nodes and the path-loss exponent [29].

Communication Model
The transmission of T2W link is on an independent channel; thus, there is no other link interference, and the signal-to-noise ratio (SNR) of the T2W transmission is where T n p is the transmission power of the train n T , , T R n m h is the channel gain between the train n T and the wayside relay m R , and 0 is the noise power. Hence, the channel throughput [15] between train n T and the wayside relay m R is As for the final hop, wayside node m R transmits to destination train n T ′ , which is also calculated in the same way as above, except that the transmission direction is different. The throughput between m R and destination train ′ n T can be expressed as In multi-hop transmission, all relays follow the decode-and-forward (DF) principle. Furthermore, we assume that the whole system is stationary in the time slot t, and that the transmit power of the nodes does not change. All channels follow quasi-static Rayleigh fading, such that the channel gain between node a and node b can be represented as follows: where h a,b is the instantaneous channel gain of link a → b , X a,b is the fading coefficient, and d a,b and β indicate the distance between two nodes and the path-loss exponent [29].

Communication Model
The transmission of T2W link is on an independent channel; thus, there is no other link interference, and the signal-to-noise ratio (SNR) of the T2W transmission is where p T n is the transmission power of the train T n , h T n ,R m is the channel gain between the train T n and the wayside relay R m , and N 0 is the noise power. Hence, the channel throughput [15] between train T n and the wayside relay R m is As for the final hop, wayside node R m transmits to destination train T n , which is also calculated in the same way as above, except that the transmission direction is different. The throughput between R m and destination train T n can be expressed as

Wayside-to-Wayside (W2W) Link
At time slot t, it is possible that more than one wireless W2W link transmits information simultaneously; thus, there is interference between W2W links [16]. During the packet transmission, the one-hop link from wayside relay i to wayside relay j at time slot t is denoted as l i,j (t), where i, j ∈ {R 1 , R 2 , . . . , R m , . . . R M }. Moreover, ρ i,j (t) represents the i → j link transmission status. ρ i,j (t) = 1 denotes that wayside node i is transmitting with node j. Therefore, the SNR for the transmission of wayside node i and wayside node j can be represented as where l i ,j (t) is the interference link during the time slot t. The transmission throughput between wayside relay i and wayside relay j is

Outage Analysis
In wireless networks, outage events occur when the actual mutual information is less than the required data rate [30]. To ensure the reliability of information transmission, the SNR of the channel must be greater than the SNR threshold value γ th to transmit. At time slot t, the transmission condition for train T n with wayside node R m is γ T n ,R m > γ th , and the transmission condition for wayside node i and wayside node j is γ i,j > γ th . In particular, the maximum transmission distance R max between train and wayside relay while the train T n is moving can be calculated as We can obtain the maximum distance R max as

Mobile Reliable Model
Due to the mobility of trains, the train T n may move out of the communication range of the wayside relay R m , resulting in an outage event; hence, the distance between the train and wayside nodes should be limited. The candidate wayside node locations are denoted as x R 1 , y R 1 , x R 2 , y R 2 , (x R m , y R m ) . . . x R M , y R M . In the transmission delay T c between train and wayside node, the channel SNR must satisfy the SNR threshold condition, whether the train is in the initial position x R 1 , y R 1 or at the end of transmission position x T 1 , y T 1 . Meanwhile, the speed of packet transmission in the channel is much faster than the speed of the train; thus, we assume that the train drives at initial speed v(t) during the transmission. In the transmission time delay T c , the distance of the train moving in the xand y-directions can be calculated as follows [15]: where e x (t) and e y (t) are the xand y-directions of the train. Therefore, the location where the train ends its transmission is x T 1 , y T 1 = x T 1 + S x , y T 1 + S y . The conditions that candidate wayside nodes need to satisfy are A node can only be a candidate transmission node for a train if its location is within the transmission range of the train at the beginning and end of the transmission.

Delay Model
During the packet transmission, a time delay is generated. In this section, we build a delay model to calculate the transmission delay between each node. We define the total delay D i,j of packet transmission between wayside nodes i, j into two main components, which are the transmission delay T n caused by the node sending the packet and the queuing delay T q due to node congestion. In this subsection, the time delay calculation is the same for both T2T transmission and T2W transmission; thus, both are expressed as the transmission between nodes a and b.
When the node sends data packets, the transmission delay can be represented as where L is the number of bits in the packet, and C a,b is the transmission rate of the channel between node a and node b.
Queuing delay [27] is unavoidable in the transmission of large amounts of data. Therefore, it is crucial to build a node queuing model. The queue follows the first-in first-out (FIFO) rule. When the CBTC system is stable, we assume that each wayside node can receive multiple data streams simultaneously to eliminate any scheduling effects, and that the average arriving rate and queuing situation are basically fixed. In order to calculate the queuing delay, we use Little's formula.
In Little's law, the average waiting time of a queue can be calculated as the queue length divided by the effective throughput. Since the buffer length of our designed model is limited, we calculate the effective throughput by considering the packet loss rate and the packet error rate. According to Little's law [31,32], the packet delay at next-hop node b can be expressed as where Q b is the average number of packets queued at node b, and Th b is the effective throughput of node b. The effective throughput of node b is indicated as where λ b is the average arriving rate of node b, λ b ∆t is the total number of packets arriving in a time slot ∆t, and p f is the link a → b unsuccessful transmission rate. There are many factors that affect the transmission unsuccessful rate, such as the packet error rate p f e and the packet loss rate p f l . If a packet is lost or transmission error occurs, this will cause this packet to be unusable; hence, the probability of unsuccessful transmission rate can be expressed as As for the calculation of the packet error rate, we can assume that the channel is modulated using the quadrature phase shift keying (QPSK) coding method. Therefore, the bit error rate (BER) [2] is where γ a,b is the SNR between the node a and node b; when the SNR between two nodes is large enough, p be ∼ = 1 . Meanwhile, as the length of the packet is L, the BER of the whole packet can be represented as For packet loss rate p f l , we build a node queuing model to solve this problem [33,34]. Packet loss is due to the limited buffer length of the node. Therefore, if the total packet length exceeds the buffer length, the packet will not be received by the node, causing an increment in packet loss rate. We define M as the maximum number of packets that the buffer can hold, while Q t−1 is denoted as the number of packets left in the previous time slot, and A t is the average number of packets arriving at the node in time slot t. We can derive the average number of arriving packets per time slot as A t = λ b ∆t, where λ b is the arrival rate of node b, and L is the length of the packet. The level of packet loss is F t , which can be expressed as When the train or wayside relay steadily sends packets to the next hop, A t and F t remain constant during transmission. Therefore, lim the packet loss rate of this node can be calculated by where E{x} is the mathematical expectation. If there is no packet overflow, then the packet loss rate is zero. Otherwise, the packet loss rate is the number of packet losses divided by the number of node arrivals. The packet loss rate and the packet error rate are calculated in Equations (18) and (20), respectively, and then introduced into Equations (14) and (15) to calculate the queuing delay. Therefore, the total delay of the k hop from node a to node b can be expressed as

Hop Tradeoff
In the next-hop selection, if we only pursue large throughput and small delay for one hop, this will result in an increase in the hop count of the entire T2T link. Therefore, we design a "hop tradeoff" indicator to optimize the number of hops on the entire link. The initial train T n and the destination train T n need to transmit information, and the distance of the whole T2T link is d T n ,T n . During the transmission process, the distance between node a and node b for one-hop distance is S a,b . We calculate the number of hops k a,b required to complete the entire T2T link for the one-hop distance S a,b , which can be represented as

Problem Formulation
In the CBTC scenario, there are different numbers of wayside nodes to assist information transmission depending on the distance between two trains; thus, the link selection for transmission is particularly important. To solve the problem of multi-hop relay selection in wireless ad hoc networks, we propose an optimal transmission model based on a discrete Markov process. The aim is to design a relay node selection decision that satisfies low latency and high throughput, so that information can be transmitted quickly and accurately between trains to each other.
Since the next-hop selection depends only on the current state and the next state changes with the current action selection, next hop selection can be considered as an MDP. The transfer probability between states in the MDP is unknown; thus, we can use the DRL approach to better solve our proposed problem. In DRL, the agent finds the optimal policy that maximizes the long-term reward value according to the channel state information (CSI) and the congestion level of the node. In this paper, we use multiagent DRL (MADRL), in Sensors 2023, 23, 5883 9 of 21 which each node acts as an agent. As agents need to make decisions in a shorter period of time, the agent's network is trained offline centrally by collecting information between nodes, and then porting it to each agent for online decision making. Therefore, each agent needs to select the next-hop node independently according to the current state, without additional communication for further training. In DRL, there are several key components, as described below.
(1) State Space In each time slot, the agent updates and learns the policy by observing the variation of the state. In particular, the state contains two components: the number of packages queued in each node and the channel throughput of the links between the two nodes. In time slot t, the state space is defined as where Q(t) indicates the queue length of each node. C(t) is the throughput between the transmission node and other nodes. V(t) = {0, 1}; when V(t) = 0, at time slot t, the wayside node sends packets, whereas V(t) = 1 means that the train sends the packet.
(2) Action Space According to the channel state and queue state of each candidate node, the optimal next-hop node A(t) is selected. The action space can be given by where then the next hop is node m. (

3) Reward Function
In the selection of the next-hop node, the optimization objective is to minimize the delay and maximum the throughput of the entire T2T link while ensuring that the nexthop SNR is greater than threshold value. The packet is successfully transmitted from the initial train to the destination train after k ∈ {1, 2, . . . , K} hops through wayside nodes i, j ∈ {R 1 , R 2 , . . . , R m , . . . R M }. Furthermore, T n is the packet source train, and T n is the packet destination train.

•
Overall Transmission Time of Whole Link The total transmission time for packet transmission between source train T n and destination train T n is •

Throughput of Whole Link
In order to better measure the quality of each hop in the transmission link and considering the packet loss and packet error rate, the throughput of the entire link is defined as follows [35,36]: where p f is the unsuccessful transmission rate of whole link; p f for the one-hop unsuccessful rate is calculated using Equation (16). C i,j denotes the throughput between waysides node i and j. The optimization goal for the proposed MADDQN is to reduce the latency and improve throughput for the whole link; hence, the proposed optimization goal is defined as maxω 1 1

• The Optimization Goal
Here, τ T n ,T n and C T n ,T n are the whole-link latency and throughput between train T n and train T n , respectively. ω 1 and ω 2 are the weight factors of the delay and throughput (ω 1 + ω 2 = 1). C1 is to ensure that the SNR of the channel is greater than threshold value. C2 indicates that each hop delay needs to be less than the target transmission time. If the one-hop delay takes too much time, the transmission is considered to fail. C3 indicates that, when the train is transmitted with the wayside node i, the distance should be less than the maximum transmission distance. The defined reward function comprehensively considers the throughput and delay of each hop, as well as adds the indicator k i,j in Section 3.2.3. Therefore, the reward function is defined as where r s is an additional reward for the final hop directly to the train. This reward is established to prevent other wayside nodes close to the train from being selected, which may lead to an increase in hops and delay. r c is the outage penalty caused by the next-hop node out of the communication range and a long single-hop delay under C1-C3.

Problem Solution
Since value-based functions are suitable for solving discrete space problems, and nexthop relay selection is a discrete action, we choose a value-based reinforcement learning approach for policy optimization. Our proposed scheme has a large number of channel states and queuing states, which leads to a high dimension of the Q-table, and makes the Q-learning [37] algorithm difficult to coverage during training. However, the DQN algorithm can solve this problem, featuring a combination of a deep neural network (DNN) and Q-value. DQN does not directly select the action with the highest Q-value in the Q-table, but fits the Q π (s t , a t , θ) through the neural network [38]. Compared to recording the Q-value for an all-action-state situation, DQN just needs to store the weights of each neuron to calculate the Q-values for all policies π(s t , a t ), which greatly reduces the storage space and makes the algorithm converge faster [39,40].
Since each node needs to make the decision to select the next-hop node, we use the multiagent dueling DQN (MADDQN) approach. MADDQN treats each node as an independent agent, and all the agents are trained centrally. When making the next-hop selection, the trained network parameters are shared with all nodes, and each node selects the next-hop node individually. The specific process of MADDQN is shown in Figure 2.
where ( ) 0,1 γ ∈ is the discount factor, which represents the ratio between immediate and long-term reward.
Due to the different channel states and node congestion levels, a large number of states are formed; thus, it is impossible to calculate the Q-value for each action and state. In DQN, the convolutional neural network (CNN) is trained to get the weight θ of each neuron. After inputting the current channel state and the next-hop relay selection action,

DQN
In each time slot t, each node acts as an agent to observe the current state s t of the system, including the congestion level and channel state information of all nodes. Then, the agent chooses a suitable action a t to select the next-hop node. After selecting an action a t , a new state s t+1 is obtained, and the reward r t corresponding to this action is also computed. The goal of the agent is to find a policy π(s t , a t ), which maximizes the expected discounted [9]. Therefore, the action-state value function is used to calculate the expected discounted cumulative reward of each relay selection policy and then select the policy π with the largest reward. The state-action value function is defined as where γ ∈ (0, 1) is the discount factor, which represents the ratio between immediate and long-term reward. r i+1 is the immediate reward at time slot i + 1. In the next time slot of action selection, not only is the next-hop relay selected with maximum Q π (s t , a t ), but the ε − greedy algorithm is also added to explore extra actions. In order to try more possible actions and avoid falling into the local maximum, the agent has ε possibility to choose an action randomly. The ε − greedy algorithm is denoted as Due to the different channel states and node congestion levels, a large number of states are formed; thus, it is impossible to calculate the Q-value for each action and state. In DQN, the convolutional neural network (CNN) is trained to get the weight θ of each neuron. After inputting the current channel state and the next-hop relay selection action, the neural network can fit a state-action value Q(s t , a t , θ). To make the network converge faster, DQN has two networks: the target network and the evaluation network. During the training process, the weights of the evaluation network θ are continuously updated, and the weights of the evaluation network are assigned to the target network θ tar at certain time intervals. Then, the weight θ is updated by the stochastic gradient descent method to minimize the result of loss function between the target network and the evaluation network. The loss function between the target network and the evaluation network is defined as where the target value for each iteration of the network is represented as To focus more on historical experiences and disrupt the correlation between experiences, DQN also uses the mechanism of experience replay. At each time slot, when nodes are trained centrally, the node acts as an agent, storing the training experience e t = (s t , a t , r t , s t+1 ) into the experience pool, and then forming the sequence D = {e 1 , e 2 . . . e N }. For each training, a small number of samples are randomly selected from the experience pool as a batch for network training, which makes the network converge better.

Dueling DQN
The dueling DQN network [19] makes further improvements on the DQN network structure. In DQN, the network directly outputs the state-action values Q(s t , a t , θ) corresponding to each relay selection policy. However, in dueling DQN, the output Q-value is split into two branches: the state value V(s t ) indicating the value of the current channel and queuing state, and the action advantage value A(s t , a t ) representing the value brought by the relay selection action. Finally, the output values of the two branches are combined to make the estimation of Q more accurate. The combination of the two branches can be written as where θ, σ, and ϑ are the coefficients of the neural network. In order to prevent multiple sets of state value V(s t ) and action advantage value A(s t , a t ) with the same state-action value Q(s t , a t ; θ, σ, ϑ), and to make the algorithm more stable [38], Equation (30) is replaced by A(s t , a t+1 ; θ, σ) .
The proposed multiagent dueling DQN is shown in Algorithm 1. Initialize network memory size J, batch size B, greedy coefficient ε, and learning rate ϕ. 2: for episode in range K do: 3: Reset channel quality C and the queue length Q of each node as initial state S initial 4: While a(t)! = Destination Node do 5: Choose action: with probability ε to choose next hop node in random.

7:
From current state s t and action a t of this hop, obtain the reward r t for this action a t and the next state s t+1 . 8: Store s t , a t , r t , s t+1 into experience reply to memory. 9: Randomly take minibatch of s t , a t , r t , s t+1 from experience reply to memory. 10: Combine two branches V(s t ; θ, ϑ) and A(s t , a t ; θ, σ) into Q(s t , a t ; θ, σ, ϑ) 11: Calculate target Q-value i f a t is the destination node r t +γ max a t+1 Q(s t+1 ,a t+1 ;θ tar ,σ,ϑ), otherwise.

12:
Minimize loss function L(θ) using Equation (30) 13: Update the target network after several steps using the parameters of the evaluation network 14: end while 15: end for

Simulation Results
In this section, we verify the effectiveness of the proposed deep learning-based relay selection algorithm by conducting simulation experiments in CBTC system.

Simulation Settings
In the simulation, TensorFlow 1.13.1 was imported in Python 3.6 as the simulation environment.
In order to simplify the system model, we performed a simulation of relay selection between two adjacent trains. If the SNR between the current node and the next-hop wayside relay is greater than the SNR threshold, then the communication between the two nodes is possible. Since each node has the same transmission power, the distance between nodes mainly determines the channel throughput; thus, the next-hop node which is closer to the current node has higher one-hop channel throughput. In the train system, packets transmitted by the train must pass through wayside nodes and cannot be delivered directly to the forward train. In addition, wayside relays are uniformly distributed on both sides of the track, as the distance between trains become longer, the number of hops required for transmission also increases.
Furthermore, in the process of training, each agent has its own training parameters; we set the batch size B = 256, greedy coefficient ε = 0.1, learning rate ϕ = 0.001, and memory size J = 1024. Some other main parameters of the communication system are shown in Table 1.

Performance Analysis
We compare the proposed MADDQN algorithm with two existing algorithms: 1. GPSR [22] (greedy perimeter stateless routing) is often used in the transmission of ad hoc networks, which collects the geographic location information of neighboring nodes and finds the next-hop node with the nearest geographic location to the destination through a greedy algorithm.

2.
The random selection scheme randomly selects the next-hop node within the communication range without any optimization strategy.

Performance Comparison of Convergence
Firstly, in order to find the learning rate that makes the proposed model converge best, we conducted experiments at three different learning rates. As shown in Figure 3, when the learning rate was equal to 10 −4 , the convergence rate of the agent was slow, and the total reward value did not reach the optimal value. To make the convergence speed up, we increased the learning rate to 10 −3 , which made the convergence faster and the total reward higher. When the learning rate increased to 10 −2 , it was easy for the agent to converge to a local optimum, resulting in poor convergence results. Therefore, in the training of the agents, the learning rate was set to 10 −3 . up, we increased the learning rate to 10 , which made the convergence faster and the total reward higher. When the learning rate increased to 2 10 − , it was easy for the agent to converge to a local optimum, resulting in poor convergence results. Therefore, in the training of the agents, the learning rate was set to 3 10 − . The goal of agent training is to better avoid outage events and reduce packet loss rate. As shown in Figure 4, the probability of outage events decreased dramatically in the first 1000 episodes, indicating that the agent learned to select next-hop relays within communication range. At the same time, the probability of network congestion gradually decreased during the training process, which illustrates that the agent successfully avoided congested nodes. The simulation results show that the MADDQN algorithm could effectively avoid outage and congestion events, ensuring the quality of transmission. The goal of agent training is to better avoid outage events and reduce packet loss rate. As shown in Figure 4, the probability of outage events decreased dramatically in the first 1000 episodes, indicating that the agent learned to select next-hop relays within communication range. At the same time, the probability of network congestion gradually decreased during the training process, which illustrates that the agent successfully avoided congested nodes. The simulation results show that the MADDQN algorithm could effectively avoid outage and congestion events, ensuring the quality of transmission.

Performance Comparison of Different Aspects
The distance between two adjacent trains is different and the wayside nodes are uniformly distributed. Thus, when the distance between two trains become larger, the number of trackside nodes between them increases and the topology of the network changes. Figure 5 depicts the curves of the variation of the total delay as the number of nodes increase. It can be observed that the whole-link delay increased from 4.31 ms to 8.43 ms under the MADDQN algorithm. The main reason is that, with the increment in the number of relays, the number of hops required for the whole link increased; thus, the total delay increased. Compared with the random selection scheme and GPSR algorithm, the transmission delay of the MADDQN algorithm was reduced by an average of 2 ms and 0.5 ms, respectively. Although the traditional GPSR algorithm requires a small number of hops, it cannot avoid congested nodes, resulting in a large packet loss rate. In addition, the random scheme can neither select nodes with small queued tasks nor optimize the hop

Performance Comparison of Different Aspects
The distance between two adjacent trains is different and the wayside nodes are uniformly distributed. Thus, when the distance between two trains become larger, the number of trackside nodes between them increases and the topology of the network changes. Figure 5 depicts the curves of the variation of the total delay as the number of nodes increase. It can be observed that the whole-link delay increased from 4.31 ms to 8.43 ms under the MADDQN algorithm. The main reason is that, with the increment in the number of relays, the number of hops required for the whole link increased; thus, the total delay increased. Compared with the random selection scheme and GPSR algorithm, the transmission delay of the MADDQN algorithm was reduced by an average of 2 ms and 0.5 ms, respectively. Although the traditional GPSR algorithm requires a small number of hops, it cannot avoid congested nodes, resulting in a large packet loss rate. In addition, the random scheme can neither select nodes with small queued tasks nor optimize the hop count; hence, the delay is longer than the previous two algorithms. uniformly distributed. Thus, when the distance between two trains become larger, the number of trackside nodes between them increases and the topology of the network changes. Figure 5 depicts the curves of the variation of the total delay as the number of nodes increase. It can be observed that the whole-link delay increased from 4.31 ms to 8.43 ms under the MADDQN algorithm. The main reason is that, with the increment in the number of relays, the number of hops required for the whole link increased; thus, the total delay increased. Compared with the random selection scheme and GPSR algorithm, the transmission delay of the MADDQN algorithm was reduced by an average of 2 ms and 0.5 ms, respectively. Although the traditional GPSR algorithm requires a small number of hops, it cannot avoid congested nodes, resulting in a large packet loss rate. In addition, the random scheme can neither select nodes with small queued tasks nor optimize the hop count; hence, the delay is longer than the previous two algorithms. The effect of buffer size on total delay is investigated in Figure 6. When the buffer size was less than 300 Kb, the total latency increased significantly as the buffer size became The effect of buffer size on total delay is investigated in Figure 6. When the buffer size was less than 300 Kb, the total latency increased significantly as the buffer size became larger. When the buffer size was small, the packet queue became shorter, resulting in lower packet delay. Moreover, when the buffer size reached 300 kb, the total delay gradually flattened out. The reason is that nodes were no longer dropping packets, and the queue length of each node tended to be stable. In addition, MADDQN improved by 0.5 ms compared to the GPSR algorithm and 3 ms compared to the random selection scheme, which illustrates that the proposed MADDQN algorithm could select nodes with shorter queues for transmission. larger. When the buffer size was small, the packet queue became shorter, resulting in lower packet delay. Moreover, when the buffer size reached 300 kb, the total delay gradually flattened out. The reason is that nodes were no longer dropping packets, and the queue length of each node tended to be stable. In addition, MADDQN improved by 0.5 ms compared to the GPSR algorithm and 3 ms compared to the random selection scheme, which illustrates that the proposed MADDQN algorithm could select nodes with shorter queues for transmission.  Figure 7 presents the relationship between the number of nodes and the average loss rate under different schemes. As the number of nodes increased from four to nine, the packet loss rate of MADDQN increased to 0.12. The reason is that, as the number of hops increased, the total packet loss rate also increased. Compared with the other two schemes,  Figure 7 presents the relationship between the number of nodes and the average loss rate under different schemes. As the number of nodes increased from four to nine, the packet loss rate of MADDQN increased to 0.12. The reason is that, as the number of hops increased, the total packet loss rate also increased. Compared with the other two schemes, the proposed MADDQN could select the next-hop relay with a shorter queuing number for transmission, which greatly reduced the packet loss probability. While the GPSR algorithm could not avoid congested nodes, it required fewer hops for transmission; hence, the packet loss rate was also lower than the random scheme.  Figure 7 presents the relationship between the number of nodes and the average rate under different schemes. As the number of nodes increased from four to nine packet loss rate of MADDQN increased to 0.12. The reason is that, as the number of increased, the total packet loss rate also increased. Compared with the other two sche the proposed MADDQN could select the next-hop relay with a shorter queuing num for transmission, which greatly reduced the packet loss probability. While the G algorithm could not avoid congested nodes, it required fewer hops for transmis hence, the packet loss rate was also lower than the random scheme.   Figure 8 depicts the change in the average loss rate with the buffer size. The average packet loss rate rapidly decreased to zero when the buffer size was 300 kb. Due to the small buffer size, packets could easily overflow. Hence, the packet loss rate continued to decrease until the buffer was large enough. The average loss rate of the GPSR and random scheme before 300 kb was higher than MADDQN, which illustrates that our proposed method had a significant effect in avoiding congested nodes and reducing the number of hops, such that the packet loss rate was the lowest. Figure 9 presents how the number of nodes affects the average throughput. It can be observed that the average throughput of the entire link decreased monotonically with the increment in the number of nodes. This is because, as shown in Figure 7, the packet loss rate increased with the number of nodes, it led to a reduction in the overall throughput. The throughput of MADDQN was greater than the other two methods, indicating that, when selecting the next hop node, MADDQN chose the node with relatively large channel throughput and fewer queued packets, ensuring channel quality. Figure 10 shows the impact of buffer size on the throughput. It can be observed that the throughput rapidly increased before 300 kb and then reached a stable state with increasing buffer size. This is because as the buffer size gradually increased, it caused a decrease in packet loss rate; therefore, the system throughput increased. When the system had no packet loss, the throughput tended to stable. Meanwhile, the proposed algorithm had the highest throughput compared with the other two schemes; hence, it can be proven that MADDQN was effective in selecting the routing with a large throughput. Figure 11 illustrates the relationship of the number of nodes and optimization goal under different weights of latency and channel throughput. The simulation results show that, with the rise in ω 1 , the optimization goal was much larger. This is because, although the optimization goal optimized both delay and throughput, when the weight of delay was high, latency was optimized more, while the optimization of throughput was relatively weak. Moreover, when the node number was between four and six, delay accounted for a large proportion of the optimization objective. Thus, when ω 1 = 0.9, the total optimization objective was the highest. However, in order to optimize both delay and throughput to a better level, we chose the case ω 1 = 0.5, ω 2 = 0.5. In the application, different parameters can also be chosen according to different needs. For example, the weight of ω 1 can be increased for safety information with higher requirements on latency. For systems where throughput is more important, the proportion of ω 2 can be increased appropriately, but the sum of ω 1 and ω 2 must be equal to 1.
Sensors 2023, 23, x FOR PEER REVIEW 18 Figure 8 depicts the change in the average loss rate with the buffer size. The ave packet loss rate rapidly decreased to zero when the buffer size was 300 kb. Due to small buffer size, packets could easily overflow. Hence, the packet loss rate continue decrease until the buffer was large enough. The average loss rate of the GPSR and ran scheme before 300 kb was higher than MADDQN, which illustrates that our prop method had a significant effect in avoiding congested nodes and reducing the numb hops, such that the packet loss rate was the lowest.  Figure 9 presents how the number of nodes affects the average throughput. It ca observed that the average throughput of the entire link decreased monotonically wit increment in the number of nodes. This is because, as shown in Figure 7, the packet rate increased with the number of nodes, it led to a reduction in the overall through The throughput of MADDQN was greater than the other two methods, indicating when selecting the next hop node, MADDQN chose the node with relatively large cha throughput and fewer queued packets, ensuring channel quality.  Figure 8 depicts the change in the average loss rate with the buffer size. The average packet loss rate rapidly decreased to zero when the buffer size was 300 kb. Due to the small buffer size, packets could easily overflow. Hence, the packet loss rate continued to decrease until the buffer was large enough. The average loss rate of the GPSR and random scheme before 300 kb was higher than MADDQN, which illustrates that our proposed method had a significant effect in avoiding congested nodes and reducing the number of hops, such that the packet loss rate was the lowest.  Figure 9 presents how the number of nodes affects the average throughput. It can be observed that the average throughput of the entire link decreased monotonically with the increment in the number of nodes. This is because, as shown in Figure 7, the packet loss rate increased with the number of nodes, it led to a reduction in the overall throughput. The throughput of MADDQN was greater than the other two methods, indicating that, when selecting the next hop node, MADDQN chose the node with relatively large channel throughput and fewer queued packets, ensuring channel quality.  the throughput rapidly increased before 300 kb and then reached a stable state wit increasing buffer size. This is because as the buffer size gradually increased, it caused decrease in packet loss rate; therefore, the system throughput increased. When the system had no packet loss, the throughput tended to stable. Meanwhile, the proposed algorithm had the highest throughput compared with the other two schemes; hence, it can be prove that MADDQN was effective in selecting the routing with a large throughput.  Figure 11 illustrates the relationship of the number of nodes and optimization goa under different weights of latency and channel throughput. The simulation results show that, with the rise in 1 ω , the optimization goal was much larger. This is because, althoug the optimization goal optimized both delay and throughput, when the weight of dela was high, latency was optimized more, while the optimization of throughput wa relatively weak. Moreover, when the node number was between four and six, dela accounted for a large proportion of the optimization objective. Thus, when 1 ω = 0.9, th total optimization objective was the highest. However, in order to optimize both dela and throughput to a better level, we chose the case 1 ω = 0.5, ω 2 = 0.5. In the application different parameters can also be chosen according to different needs. For example, th weight of 1 ω can be increased for safety information with higher requirements o latency. For systems where throughput is more important, the proportion of ω 2 can b increased appropriately, but the sum of 1 ω and ω 2 must be equal to 1. The effect of the number of nodes and the buffer size on the optimization goal i investigated in Figures 12 and 13. The optimization goal was derived from Equation (27) which is a comprehensive indicator of channel throughput and transmission delay. In Figure 12, the delay increased and the throughput decreased as the number of nodes grew thus, the optimization objective was gradually reduced. This shows that, as the numbe of nodes increased, the overall performance of the system worsened. In Figure 13, both latency and throughput tended to rise as the buffer size increased, but latency rose faster having a greater impact on the optimization objective. Therefore, the optimization goa showed a slight decrease after combining these two indicators. Moreover, we can observ that our proposed algorithm always outperformed the existing algorithms, indicating tha the MADDQN algorithm could better tradeoff the channel throughput and transmission delay under any topological condition and buffer size to achieve the optimization goal. The effect of the number of nodes and the buffer size on the optimization goal is investigated in Figures 12 and 13. The optimization goal was derived from Equation (27), which is a comprehensive indicator of channel throughput and transmission delay. In Figure 12, the delay increased and the throughput decreased as the number of nodes grew; thus, the optimization objective was gradually reduced. This shows that, as the number of nodes increased, the overall performance of the system worsened. In Figure 13, both latency and throughput tended to rise as the buffer size increased, but latency rose faster, having a greater impact on the optimization objective. Therefore, the optimization goal showed a slight decrease after combining these two indicators. Moreover, we can observe that our proposed algorithm always outperformed the existing algorithms, indicating that the MADDQN algorithm could better tradeoff the channel throughput and transmission delay under any topological condition and buffer size to achieve the optimization goal. latency and throughput tended to rise as the buffer size increased, but latency rose faster, having a greater impact on the optimization objective. Therefore, the optimization goal showed a slight decrease after combining these two indicators. Moreover, we can observe that our proposed algorithm always outperformed the existing algorithms, indicating that the MADDQN algorithm could better tradeoff the channel throughput and transmission delay under any topological condition and buffer size to achieve the optimization goal.

Conclusions and Future Work
In this paper, we designed a multi-hop relay selection strategy based on wireless ad hoc networks to assist T2T communication. The optimization goal of our proposed algorithm is to reduce the T2T transmission delay and increase the throughput of the entire link in a congested network. Since the channel status changes in real time, an MADDQN approach was proposed to better solve the problem. Simulation results showed that our proposed algorithm could effectively avoid congested nodes and reduce the number of hops for the whole-link transmission, thereby better achieving the optimization goal compared with existing routing algorithms. In future work, the energy consumption of the nodes and the problem of retransmission after packet loss should be considered. Moreover, some secure and energy-efficient technologies, such as reconfigurable intelligent surfaces (RIS), will be applied to the CBTC system to better assist in signal transmission.

Conclusions and Future Work
In this paper, we designed a multi-hop relay selection strategy based on wireless ad hoc networks to assist T2T communication. The optimization goal of our proposed algorithm is to reduce the T2T transmission delay and increase the throughput of the entire link in a congested network. Since the channel status changes in real time, an MADDQN approach was proposed to better solve the problem. Simulation results showed that our proposed algorithm could effectively avoid congested nodes and reduce the number of hops for the whole-link transmission, thereby better achieving the optimization goal compared with existing routing algorithms. In future work, the energy consumption of the nodes and the problem of retransmission after packet loss should be considered. Moreover, some secure and energy-efficient technologies, such as reconfigurable intelligent surfaces (RIS), will be applied to the CBTC system to better assist in signal transmission.