1. Introduction
Recently, with the rapid development of urbanization, urban rail transit has become one of the main transportations. With the development of technology, the communication-based train control (CBTC) system plays an important role in urban rail transit to guarantee the safe operation of rail trains [
1]. To ensure their safety and reliability, CBTC systems have strict requirements on transmission delay and channel quality [
2]. Long communication delay and link interruptions may lead to emergency brakes or collisions [
3]. Therefore, it is crucial to design a CBTC communication system with low latency and high channel quality.
In traditional CBTC systems, long-term evolution for metro (LTE-M) and wireless local area networks (WLANs) are more widely used in train-to-wayside communication [
4]. The train information is first transmitted to the ground-zone controller (G-ZC), which is used to generate control commands for all trains in its management area [
5]. After obtaining the commands, the wayside node sends the commands back to trains. However, due to the huge computational burden of the G-ZC and non-direct transmission link [
6], the transmission delay of important control commands is excessively large. Therefore, the T2T direct transmission approach was proposed [
4], while the G-ZC was also changed to an onboard-zone controller (On-ZC). Unlike the G-ZC, the On-ZC only needs to generate its own commands, which greatly reduces computation latency.
Although the direct T2T transmission greatly reduces latency [
7], if we continue to use WLAN or LTE, interruptions and delays caused by hard handoff at the base station boundary are still unavoidable. In addition, the deployment of the terrestrial core network and base station are complex, which makes network construction and maintenance cost high. Therefore, new technologies such as reconfigurable intelligent surfaces [
8,
9] and wireless ad hoc networks have been proposed to improve T2T communication.
The wireless ad hoc network is a novel approach to improve the performance of CBTC systems. In the wireless ad hoc network, the packages are sent to the wayside node from the on-board node. Similar to the vehicular ad hoc network (VANET), the role of wayside nodes is to assist packet transmission [
10]. Therefore, the packages are transmitted through the transmission network formed by the wayside nodes hop-by-hop and finally to the running train, so that the transmission link is more stable. Furthermore, the deployment of wireless ad hoc network nodes is less costly, and it does not have a fixed topology or require any fixed infrastructure to communicate, allowing it to be deployed more flexibly and configured quickly.
The relay selection strategy plays an important role in wireless ad hoc networks to reduce transmission delay, improve throughput, and decrease packet loss. However, since wireless ad hoc networks are rarely used in CBTC systems, no suitable routing strategy has been proposed. Therefore, the existing strategies from VANETs and mobile ad hoc networks (MANETs) should be considered and improved to adapt to the CBTC system.
In VANET and MANET relay selection strategies, the routing algorithms are generally divided into two types: proactive routing and reactive routing [
11]. Traditional proactive routing approaches, such as optimized link-state routing (OLSR) [
12] and destination sequenced distance vector (DSDV) [
13], require significant overhead for node exploration and maintenance of routing tables, which is not feasible for CBTC systems with strict requirements for low latency. At the same time, although the traditional reactive routing greedy perimeter stateless routing (GPSR) can directly select the next-hop relay, it cannot fully consider various factors such as node congestion and channel quality. In the CBTC system, the transmission channel can be affected by the high mobility of the train, causing shadow fading and multipath fading, which leads to a sudden change in transmission link. Therefore, the relay selection strategy must be able to make decisions according to the varying channel state.
On the basis of past training experience, the learning-based routing algorithm can make real-time decisions well by observing current channel state. Meanwhile, in deep reinforcement learning (DRL), the optimization problem of multiple factors can be transformed to maximize cumulative rewards [
14]. We can combine diverse factors to design rewards, in order to achieve optimization of these indicators. Therefore, learning-based routing algorithms are more suitable for CBTC networks. Existing DRL-based algorithms are often used in relay selection for VANET and MANET [
15,
16]; these algorithms consider more focus on single-hop delay, outage probability, and power consumption as the criteria for next-hop selection. However, they ignore the whole-link delay and throughput.
In this paper, our objective is to design a low-latency and high-throughput routing method in an ad hoc network for a CBTC system. However, packet transmission still faces challenges in multi-hop relay selection. For example, due to the high-speed train, the transmission distance between the train and wayside node is limited. In order to decrease the outage probability, we set strict distance limitations when transmitting with trains. In addition, we comprehensively consider the transmission delay, queuing delay, and channel quality, aiming to optimize the overall performance of the link by selecting the next-hop node. Moreover, the process of selecting the next-hop node can be formulated as a Markov decision process (MDP) [
17,
18]. We propose a multi-agent DRL method to solve the problem. The main contributions of this paper are summarized as follows:
We formulate the next-hop relay selection problem in a CBTC system. The goal is to select relay nodes with low transmission delay and high throughput in both the train and the wayside node communication range. Meanwhile, in order to balance the single-hop transmission delay and the whole-link hop count, we propose the concept of “hop tradeoff” to minimize the entire link latency.
To handle the time-varying channel state and node congestion, we propose a DRL algorithm to optimize the long-term system reward. Using a multiagent approach [
14], all nodes are trained centrally with dueling DQN [
19], and then each node makes the next-hop decision individually, in order to avoid nodes with a long queuing delay and poor channel quality.
Lastly, we conduct simulations with a different number of nodes between two trains and different buffer sizes. Meanwhile, the proposed algorithm is compared with several existing algorithms in terms of whole-link delay, packet loss rate, and throughput. The simulation results indicate that the proposed scheme works well against congested networks. In particular, it also significantly superior to other routing algorithms in the aspects of whole-link delay, throughput, and packet loss rate.
The remainder of this paper is organized as follows: in
Section 2, some related work about routing selection in ad hoc networks is introduced; in
Section 3, we present a multi-hop relay selection model for ad hoc networks in CBTC systems; the joint optimization problem of channel throughput and total-link delay is formulated in
Section 4; then, we introduce the multiagent deep reinforcement method to solve the formulated problem in
Section 5; some simulation results and analyses are presented in
Section 6; in
Section 7, we conclude the paper and propose some future work.
3. System Model
As shown in
Figure 1, we consider a T2T communication over a multi-hop wireless ad hoc network. In this scenario, since the coverage of one hop is very limited, the train needs the assistance of wayside relays for multi-hop transmission. There are multiple relays within the communication range of each train and wayside node; hence, they need to select the most suitable next-hop wayside relays among these candidate nodes. For example,
may communicate with
,
, and
, but
is chosen as the next hop node by considering factors such as channel quality and delay.
Therefore, for
trains running on the rail, denoted as
there are
wayside relays distributed beside the rail, denoted as
The train has high mobility; in order to ensure the quality of the T2W transmission, we assume that there are two orthogonal frequency bands available: band 1 for train-to-wayside (T2W) transmission [
28] and band 2 for wireless wayside-to-wayside (W2W) transmission. Since the transmission is on two orthogonal channels, there is no interference between T2W and W2W transmission, while multiple W2W transmissions at the same time will cause interference.
In multi-hop transmission, all relays follow the decode-and-forward (DF) principle. Furthermore, we assume that the whole system is stationary in the time slot
, and that the transmit power of the nodes does not change. All channels follow quasi-static Rayleigh fading, such that the channel gain between node
and node
can be represented as follows:
where
is the instantaneous channel gain of link
,
is the fading coefficient, and
and
indicate the distance between two nodes and the path-loss exponent [
29].
3.1. Communication Model
3.1.1. Train-to-Wayside (T2W) Link
The transmission of T2W link is on an independent channel; thus, there is no other link interference, and the signal-to-noise ratio (SNR) of the T2W transmission is
where
is the transmission power of the train
,
is the channel gain between the train
and the wayside relay
, and
is the noise power. Hence, the channel throughput [
15] between train
and the wayside relay
is
As for the final hop, wayside node
transmits to destination train
, which is also calculated in the same way as above, except that the transmission direction is different. The throughput between
and destination train
can be expressed as
3.1.2. Wayside-to-Wayside (W2W) Link
At time slot
, it is possible that more than one wireless W2W link transmits information simultaneously; thus, there is interference between W2W links [
16]. During the packet transmission, the one-hop link from wayside relay
to wayside relay
at time slot
is denoted as
, where
. Moreover,
represents the
link transmission status.
denotes that wayside node
is transmitting with node
. Therefore, the SNR for the transmission of wayside node
and wayside node
can be represented as
where
is the interference link during the time slot
.
The transmission throughput between wayside relay
and wayside relay
is
3.1.3. Outage Analysis
In wireless networks, outage events occur when the actual mutual information is less than the required data rate [
30]. To ensure the reliability of information transmission, the SNR of the channel must be greater than the SNR threshold value
to transmit. At time slot
, the transmission condition for train
with wayside node
is
, and the transmission condition for wayside node
and wayside node
is
. In particular, the maximum transmission distance
between train and wayside relay while the train
is moving can be calculated as
We can obtain the maximum distance
as
3.2. Optimal Relay Selection
3.2.1. Mobile Reliable Model
Due to the mobility of trains, the train
may move out of the communication range of the wayside relay
, resulting in an outage event; hence, the distance between the train and wayside nodes should be limited. The candidate wayside node locations are denoted as
In the transmission delay
between train and wayside node, the channel SNR must satisfy the SNR threshold condition, whether the train is in the initial position
or at the end of transmission position
. Meanwhile, the speed of packet transmission in the channel is much faster than the speed of the train; thus, we assume that the train drives at initial speed
during the transmission. In the transmission time delay
, the distance of the train moving in the
- and
-directions can be calculated as follows [
15]:
where
and
are the
- and
-directions of the train. Therefore, the location where the train ends its transmission is
The conditions that candidate wayside nodes need to satisfy are
A node can only be a candidate transmission node for a train if its location is within the transmission range of the train at the beginning and end of the transmission.
3.2.2. Delay Model
During the packet transmission, a time delay is generated. In this section, we build a delay model to calculate the transmission delay between each node. We define the total delay of packet transmission between wayside nodes into two main components, which are the transmission delay caused by the node sending the packet and the queuing delay due to node congestion. In this subsection, the time delay calculation is the same for both T2T transmission and T2W transmission; thus, both are expressed as the transmission between nodes and .
When the node sends data packets, the transmission delay can be represented as
where
is the number of bits in the packet, and
is the transmission rate of the channel between node
and node
.
Queuing delay [
27] is unavoidable in the transmission of large amounts of data. Therefore, it is crucial to build a node queuing model. The queue follows the first-in first-out (FIFO) rule. When the CBTC system is stable, we assume that each wayside node can receive multiple data streams simultaneously to eliminate any scheduling effects, and that the average arriving rate and queuing situation are basically fixed. In order to calculate the queuing delay, we use Little’s formula.
In Little’s law, the average waiting time of a queue can be calculated as the queue length divided by the effective throughput. Since the buffer length of our designed model is limited, we calculate the effective throughput by considering the packet loss rate and the packet error rate. According to Little’s law [
31,
32], the packet delay at next-hop node
can be expressed as
where
is the average number of packets queued at node
, and
is the effective throughput of node
.
The effective throughput of node
is indicated as
where
is the average arriving rate of node
,
is the total number of packets arriving in a time slot
, and
is the link
unsuccessful transmission rate. There are many factors that affect the transmission unsuccessful rate, such as the packet error rate
and the packet loss rate
. If a packet is lost or transmission error occurs, this will cause this packet to be unusable; hence, the probability of unsuccessful transmission rate can be expressed as
As for the calculation of the packet error rate, we can assume that the channel is modulated using the quadrature phase shift keying (QPSK) coding method. Therefore, the bit error rate (BER) [
2] is
where
is the SNR between the node
and node
; when the SNR between two nodes is large enough,
. Meanwhile, as the length of the packet is
, the BER of the whole packet can be represented as
For packet loss rate
, we build a node queuing model to solve this problem [
33,
34]. Packet loss is due to the limited buffer length of the node. Therefore, if the total packet length exceeds the buffer length, the packet will not be received by the node, causing an increment in packet loss rate. We define
as the maximum number of packets that the buffer can hold, while
is denoted as the number of packets left in the previous time slot, and
is the average number of packets arriving at the node in time slot
. We can derive the average number of arriving packets per time slot as
, where
is the arrival rate of node
, and
is the length of the packet. The level of packet loss is
, which can be expressed as
When the train or wayside relay steadily sends packets to the next hop,
and
remain constant during transmission. Therefore,
and
. In time slot
, the packet loss rate of this node can be calculated by
where
is the mathematical expectation. If there is no packet overflow, then the packet loss rate is zero. Otherwise, the packet loss rate is the number of packet losses divided by the number of node arrivals. The packet loss rate and the packet error rate are calculated in Equations (18) and (20), respectively, and then introduced into Equations (14) and (15) to calculate the queuing delay. Therefore, the total delay of the
hop from node
to node
can be expressed as
3.2.3. Hop Tradeoff
In the next-hop selection, if we only pursue large throughput and small delay for one hop, this will result in an increase in the hop count of the entire T2T link. Therefore, we design a “hop tradeoff” indicator to optimize the number of hops on the entire link. The initial train
and the destination train
need to transmit information, and the distance of the whole T2T link is
. During the transmission process, the distance between node
and node
for one-hop distance is
. We calculate the number of hops
required to complete the entire T2T link for the one-hop distance
, which can be represented as
4. Problem Formulation
In the CBTC scenario, there are different numbers of wayside nodes to assist information transmission depending on the distance between two trains; thus, the link selection for transmission is particularly important. To solve the problem of multi-hop relay selection in wireless ad hoc networks, we propose an optimal transmission model based on a discrete Markov process. The aim is to design a relay node selection decision that satisfies low latency and high throughput, so that information can be transmitted quickly and accurately between trains to each other.
Since the next-hop selection depends only on the current state and the next state changes with the current action selection, next hop selection can be considered as an MDP. The transfer probability between states in the MDP is unknown; thus, we can use the DRL approach to better solve our proposed problem. In DRL, the agent finds the optimal policy that maximizes the long-term reward value according to the channel state information (CSI) and the congestion level of the node. In this paper, we use multiagent DRL (MADRL), in which each node acts as an agent. As agents need to make decisions in a shorter period of time, the agent’s network is trained offline centrally by collecting information between nodes, and then porting it to each agent for online decision making. Therefore, each agent needs to select the next-hop node independently according to the current state, without additional communication for further training. In DRL, there are several key components, as described below.
- (1)
State Space
In each time slot, the agent updates and learns the policy by observing the variation of the state. In particular, the state contains two components: the number of packages queued in each node and the channel throughput of the links between the two nodes. In time slot
, the state space is defined as
where
indicates the queue length of each node.
is the throughput between the transmission node and other nodes.
; when
, at time slot
, the wayside node sends packets, whereas
means that the train sends the packet.
- (2)
Action Space
According to the channel state and queue state of each candidate node, the optimal next-hop node
is selected. The action space can be given by
where
. If
, then the next hop is node
.
- (3)
Reward Function
In the selection of the next-hop node, the optimization objective is to minimize the delay and maximum the throughput of the entire T2T link while ensuring that the next-hop SNR is greater than threshold value. The packet is successfully transmitted from the initial train to the destination train after hops through wayside nodes . Furthermore, is the packet source train, and is the packet destination train.
The total transmission time for packet transmission between source train
and destination train
is
In order to better measure the quality of each hop in the transmission link and considering the packet loss and packet error rate, the throughput of the entire link is defined as follows [
35,
36]:
where
is the unsuccessful transmission rate of whole link;
for the one-hop unsuccessful rate is calculated using Equation (16).
denotes the throughput between waysides node
and
.
The optimization goal for the proposed MADDQN is to reduce the latency and improve throughput for the whole link; hence, the proposed optimization goal is defined as
Here,
and
are the whole-link latency and throughput between train
and train
, respectively.
and
are the weight factors of the delay and throughput (
).
C1 is to ensure that the SNR of the channel is greater than threshold value.
C2 indicates that each hop delay needs to be less than the target transmission time. If the one-hop delay takes too much time, the transmission is considered to fail.
C3 indicates that, when the train is transmitted with the wayside node
, the distance should be less than the maximum transmission distance.
The defined reward function comprehensively considers the throughput and delay of each hop, as well as adds the indicator
in
Section 3.2.3. Therefore, the reward function is defined as
where
is an additional reward for the final hop directly to the train. This reward is established to prevent other wayside nodes close to the train from being selected, which may lead to an increase in hops and delay.
is the outage penalty caused by the next-hop node out of the communication range and a long single-hop delay under
C1–
C3.
5. Problem Solution
Since value-based functions are suitable for solving discrete space problems, and next-hop relay selection is a discrete action, we choose a value-based reinforcement learning approach for policy optimization. Our proposed scheme has a large number of channel states and queuing states, which leads to a high dimension of the Q-table, and makes the Q-learning [
37] algorithm difficult to coverage during training. However, the DQN algorithm can solve this problem, featuring a combination of a deep neural network (DNN) and Q-value. DQN does not directly select the action with the highest Q-value in the Q-table, but fits the
through the neural network [
38]. Compared to recording the Q-value for an all-action-state situation, DQN just needs to store the weights of each neuron to calculate the Q-values for all policies
, which greatly reduces the storage space and makes the algorithm converge faster [
39,
40].
Since each node needs to make the decision to select the next-hop node, we use the multiagent dueling DQN (MADDQN) approach. MADDQN treats each node as an independent agent, and all the agents are trained centrally. When making the next-hop selection, the trained network parameters are shared with all nodes, and each node selects the next-hop node individually. The specific process of MADDQN is shown in
Figure 2.
5.1. DQN
In each time slot
, each node acts as an agent to observe the current state
of the system, including the congestion level and channel state information of all nodes. Then, the agent chooses a suitable action
to select the next-hop node. After selecting an action
, a new state
is obtained, and the reward
corresponding to this action is also computed. The goal of the agent is to find a policy
, which maximizes the expected discounted cumulative reward
[
9]. Therefore, the action-state value function is used to calculate the expected discounted cumulative reward of each relay selection policy and then select the policy
with the largest reward. The state-action value function is defined as
where
is the discount factor, which represents the ratio between immediate and long-term reward.
is the immediate reward at time slot
.
In the next time slot of action selection, not only is the next-hop relay selected with maximum
, but the
algorithm is also added to explore extra actions. In order to try more possible actions and avoid falling into the local maximum, the agent has
possibility to choose an action randomly. The
algorithm is denoted as
Due to the different channel states and node congestion levels, a large number of states are formed; thus, it is impossible to calculate the Q-value for each action and state. In DQN, the convolutional neural network (CNN) is trained to get the weight
of each neuron. After inputting the current channel state and the next-hop relay selection action, the neural network can fit a state-action value
. To make the network converge faster, DQN has two networks: the target network and the evaluation network. During the training process, the weights of the evaluation network
are continuously updated, and the weights of the evaluation network are assigned to the target network
at certain time intervals. Then, the weight
is updated by the stochastic gradient descent method to minimize the result of loss function between the target network and the evaluation network. The loss function between the target network and the evaluation network is defined as
where the target value for each iteration of the network is represented as
To focus more on historical experiences and disrupt the correlation between experiences, DQN also uses the mechanism of experience replay. At each time slot, when nodes are trained centrally, the node acts as an agent, storing the training experience into the experience pool, and then forming the sequence . For each training, a small number of samples are randomly selected from the experience pool as a batch for network training, which makes the network converge better.
5.2. Dueling DQN
The dueling DQN network [
19] makes further improvements on the DQN network structure. In DQN, the network directly outputs the state-action values
corresponding to each relay selection policy. However, in dueling DQN, the output Q-value is split into two branches: the state value
indicating the value of the current channel and queuing state, and the action advantage value
representing the value brought by the relay selection action. Finally, the output values of the two branches are combined to make the estimation of
more accurate. The combination of the two branches can be written as
where
,
, and
are the coefficients of the neural network. In order to prevent multiple sets of state value
and action advantage value
with the same state-action value
, and to make the algorithm more stable [
38], Equation (30) is replaced by
The proposed multiagent dueling DQN is shown in Algorithm 1.
Algorithm 1 Dueling-DQN |
1: Initialization: Initialize the maximum buffer capacity and packet length ; Initialize the number of nodes along the rail ; Initialize network memory size , batch size , greedy coefficient , and learning rate . 2: for episode in range K do: 3: Reset channel quality and the queue length of each node as initial state 4: While do 5: Choose action: with probability to choose next hop node in random. 6: Otherwise, choose action with . 7: From current state and action of this hop, obtain the reward for this action and the next state . 8: Store into experience reply to memory. 9: Randomly take minibatch of from experience reply to memory. 10: Combine two branches and into 11: Calculate target Q-value
12: Minimize loss function using Equation (30) 13: Update the target network after several steps using the parameters of the evaluation network |
14: end while 15: end for |