Reinforcement Learning-Based Multihop Relaying: A Decentralized Q-Learning Approach

Conventional optimization-based relay selection for multihop networks cannot resolve the conflict between performance and cost. The optimal selection policy is centralized and requires local channel state information (CSI) of all hops, leading to high computational complexity and signaling overhead. Other optimization-based decentralized policies cause non-negligible performance loss. In this paper, we exploit the benefits of reinforcement learning in relay selection for multihop clustered networks and aim to achieve high performance with limited costs. Multihop relay selection problem is modeled as Markov decision process (MDP) and solved by a decentralized Q-learning scheme with rectified update function. Simulation results show that this scheme achieves near-optimal average end-to-end (E2E) rate. Cost analysis reveals that it also reduces computation complexity and signaling overhead compared with the optimal scheme.


Introduction
Multihop relaying is believed to extend transmission range and to form the essential communication structure for many practical networks, such as ad hoc networks and vehicular networks. In these networks, candidate relays for each hop are often clustered. For example, in vehicular networks, a vehicle gets access to a roadside unit (RSU) with help of multiple relay vehicles which are often geographically clustered. Therefore, judiciously designed relay selection policies guarantee a stable and efficient communication path. The optimal selection policy searches for the best path based on maximization algorithm and inter-cluster channel state information (CSI) of all hops. Its computational complexity and signaling overhead are considerably high. Then, decentralized selection schemes have been proposed to reduce costs by compromising a proportion of performance [1][2][3][4]. Ref. [2] considered clustered multihop networks and proposed a decentralized relay selection scheme which selects a set of relays. This scheme explores multiuser diversity but causes interference, so the size of selected set should be very small. In [3], a decentralized selection scheme is proposed to choose the best relay for each cluster with the consideration of physical-layer security. Another way to design decentralized relay selection is to set a timer for each node within a cluster, which is the reciprocal of CSI. The node whose timer counts to 0 first is selected as the relay. In spite of these academic efforts, satisfactory tradeoff between performance and cost has not been achieved. Therefore, it is meaningful to investigate brand-new decentralized selection scheme which further narrows the performance gap to the optimal policy.
Recently, machine learning has found its extensive applications in optimizations for wireless communications such as antenna selection [5,6], relay selection [7][8][9], and power allocation [10,11]. Learning tools used to solve these problems include supervised learning [5][6][7][8], reinforcement learning [12], neural network (NN) [9], etc. The flourish of learning-based optimization inspires us to exploit new multihop relay selection schemes. Recently, more complex optimization problems have been solved by reinforcement learning in dualhop relay networks, combining relay selection with other techniques such as energy harvesting [13], buffer aided relays [14], device-to-device (D2D) communications [15], access control [16], etc. In [7,8], relay selection for dualhop networks is modeled as multiclass classification and solved by decision tree. However, multihop clustered relaying yields large number of possible paths, which makes classification inefficient. To solve multihop relay selection problem, we will design a novel learning-based scheme. In [17,18], relay selection schemes based on reinforcement learning are proposed for dualhop networks, but these schemes cannot be extended to multihop networks.
In this paper, multihop relay selection is modeled as Markov decision processes (MDP) and solved by reinforcement learning. We propose a Q-learning-based decentralized algorithm which allows each cluster to train its own Q-table and predict relay selection. We aim to reduce computational complexity and signaling overhead while keeping nearoptimal average end-to-end (E2E) rate.

Communication Model
We consider a linear multihop network with a source node (S), a destination node (D) and M clusters of relays denoted by C 1 , . . . , C M . C m consists of K m decode-and-forward (DF) relay nodes denoted by R m 1 , R m 2 , . . . , R m K m , m = 1, . . . , M. For convenience, we let C 0 denote S and C M+1 denote D. Then, K 0 = K M+1 = 1. Neither the relays nor S has direct link to D, except the nodes in C M . The signal transmitted by S is delivered to D along a path composed of M relays selected from the M clusters. At the beginning of the multihop transmission, S broadcasts its data to C 1 . One member of C 1 is selected to receive the data and broadcasts it to C 2 . By this means, the data is relayed until it is received by D.
The wireless channels in the network are assumed to experience independent and identically distributed (i.i.d.) Rayleigh fading. The noise at each receiver is modeled as complex Gaussian random variable with zero mean and variance σ 2 . h m k k denotes intercluster complex channel coefficients from R m−1 Assuming that transmitting power of R m−1 k is P m−1 , received signal-to-noise ratio (SNR) of R m k is given by E2E SNR and E2E rate of the multihop communication are expressed as and

Optimal Selection
In the considered network, there are ∏ m K m possible paths. The optimal selection is a centralized maximization-based scheme which chooses the best path. The central controller collects h m k k for all k, k , m, and computes Γ m k k and Γ E for each path. The optimal policy is to select the path (relay combination) yielding maximum R e2e , as described by Here, m * denotes the index of selected relay of C m and 0 * represents S.
Selecting the maximum from ∏ m K m values is of O(∏ m K m ) complexity, and requires all inter-cluster CSI. Designing decentralized multihop relay selection schemes will reduce these costs.

Conventional Decentralized Selection
To distribute the computations, the M relays are selected separately and successively at each cluster. For 1 ≤ m ≤ M − 1, the conventional decentralized selection policy is described by and for m = M,

Q-Learning-Based Multihop Relaying
In the considered network, a cluster only requires information from the previous and the next clusters. Hence, multihop transmission is naturally Markov. In this section, we model multihop relay selection as MDP and propose a Q-learning-based decentralized relay selection scheme. The scheme is composed of three phases: initialization, training, and prediction. The tasks of training and prediction are decentralized to each cluster. Hence, each cluster, including D, maintains a Q-table and a reward table; Training and prediction are completed in a successive manner from C 1 to C M . The Q-tables are updated for multiple episodes until convergence is reached, and then are used to search for the best relays. First, we provide basic definitions for a standard Q-learning algorithm [19], taking the algorithm on C m for example.
State: s m represents the selected relay node of C m−1 , which broadcasts the datacarrying signal to C m .
Action: a m represents the relay node selected from C m which will receive the signal broadcast by Q-value: Q m (s m , a m ) denotes the Q-value for a given state-action pair (s m , a m ), which is defined to evaluate the accumulated value of s m . Q m (s m , a m ) is stored in the Q-table of C m and is obtained by iterative updating.
System parameters include learning rate α, discount factor γ, and error threshold of convergence ε. A brief illustration of this scheme is given in Figure 1. Table   Q- Table   Figure 1. Q-learning-based multihop relaying.

Initialization
The reward table and Q-table of C m are initialized by K m−1 × K m tables, m = 0, 1, . . . , M + 1. The reward table stores the rewards of all possible state-action pairs. To initialize the reward table, C m estimates CSI of the channels from each node of C m−1 to each node of its own, and calculates reward values using (7). The Q-table stores Q-values of all (s m , a m ) pairs. Each row represents a node in C m−1 and each column represents a node in C m . All Q-values of tables from C 1 to C M are initialized to be 0. The Q-

Training
Training phase is iteratively updating the Q-tables of each cluster in a successive manner for multiple episodes until convergence is reached. The key issue of training is the update function. If we adopt the standard update function given by [19], only the reward of current state is involved, i.e. the rate of current hop. However, in DF multihop relaying, the rate of an individual hop cannot contribute to the calculation of end-to-end rate if the rate of another hop is even smaller. Therefore, the update function should take into account all hops, which is not economical. We predict that the standard update function will not yield high performance. Instead, we revise the standard update function and make it choose the larger one between the rewards of current hop and the next hop. This means that the update function can eventually maximize the end-to-end rate. The new update function is given by (8).
In the beginning of the nth updating episode, s 1 = S and a 1 is randomly selected from C 1 . C 2 chooses the best action, a 2 , by and computes the Q-value and reward of a 2 by r m+1 = r m+1 (a m , a m+1 ).
Then, Q max (2) and r 2 are sent back to C 1 , and are used to update Q 1 (S, a 1 ) by (8). For C 2 , s 2 = a 1 , and a 2 is randomly selected from C 2 . The rest procedures are the same as C 1 , and Q 2 (s 2 , a 2 ) is updated. The above updating procedure for a single cluster is repeated to each of the following clusters to finish this episode. If the error of E2E rate is lower than ε, the Q-tables are converged and training is completed. Else, a new episode of updating begins.

Prediction
Successively from C 1 to C m , each cluster searches its Q-table and selects the best relay. First, C 1 searches its single-row Q-table for the action with maximum Q-value and selects it as the best relay. This selection policy is described by Notified of 1 * , C 2 searches the corresponding row of 1 * from its Q-table and obtains 2 * . After all clusters complete their relay selection, a multihop path is established through which the source data is delivered to D. The proposed Q-learning-based relay selection scheme for multihop clustered networks is summarized in Algorithm 1. C m randomly selects an action a m from C m Request Q max (m + 1) and r m+1 from C m+1 Update the Q m (s m , a m ) using (8) s m+1 = a m+1 end Compute end-to-end rate R n If R n − R n−1 < break else n = n + 1 end end Phase 3. Prediction Let 0 * = S for m = 1 : M C m searches in its Q-table and selects the best relay m * using (12) Notify m * to C m+1 R m m * receives the signal transmitted from R m−1 (m−1) * , and broadcasts it to C m+1 end

Simulation Results
We simulate a multihop network with M clusters and each cluster contains equally K relays. Average E2E rate is calculated as the performance metric. Simulation parameters are given in Table 3. We first examine the convergence of the Q-learning-based relay selection. Then, the scheme is compared with optimal scheme given by (4) and conventional decentralized scheme described by (5) and (6).
First, Figures 2 and 3 show the convergence of the proposed Q-learning scheme with respect to K and M. It is observed from Figure 2 that more iterations are needed for convergence if K increases. When K is fixed, we observe from Figure 3 that M has little impact on the number of iterations for convergence. This means that the proposed scheme applies to long route.   Figure 4 shows average E2E rates of the three schemes with respect to K. The first observation is that the curve of Q-learning scheme is very close to the curve of the optimal scheme, especially when K is small. The gap grows larger when K increases, which indicates that Q-learning scheme cannot consistently benefit from growing K. Without designed update function, the Q-learning scheme achieves the lowest E2E rate and cannot benefit from growing K. Another important observation is that when K ≤ 19, Q-learning scheme clearly outperforms the conventional decentralized scheme. After K = 19, the conventional decentralized scheme yields better performance. This is because great K yields large action space, and Q-learning cannot work well with large action space. This issue can be easily avoided because larger K also increases computational complexity. So, normally cluster size should be controlled.   Figure 5 illustrates average E2E rates with respect to M. As M grows larger, average E2E rates of all three schemes decrease because E2E rate is bounded by the worst hop. We also observe that the curve of Q-learning scheme is very close to optimal scheme. Moreover, Q-learning scheme clearly outperforms conventional decentralized scheme. The advantage of Q-learning scheme keeps unchanged when M increases. From above figures, we summarize that the proposed Q-learning scheme achieves near-optimal E2E rate. To take the best advantage of it, the proposed scheme is better applied to multihop linear networks with moderate cluster size.

Computational Complexity
The optimal policy selects the path with maximum E2E rate among all K M paths, leading to computational complexity of O(K M ). In the proposed Q-learning scheme, the complexities of initialization and prediction are O(K 2 ) and O(K). The main part of training phase is iterative updating of Q-tables, leading to complexity of O(MK log 1 ε ). In most practical networks, Q-learning scheme is superior to the optimal scheme.

CSI Amount
The optimal scheme is centralized and requires CSI of all (M − 1)K 2 + 2K intercluster links reported to the central controller. In the proposed Q-learning scheme, CSI of (M − 1)K 2 + 2K inter-cluster links is only estimated and collected locally between adjacent clusters. Thus, total energy consumed and interference to other transmissions caused by signaling are greatly reduced. In each iterative update, each cluster requires only two values, Q max (m + 1) and r m+1 , from the next cluster to update its Q-table. This causes extra communication costs. Moreover, each cluster transmits the selected action a m to the next cluster, costing only log 2 K bits.
Signaling overhead of multihop networks is mainly due to CSI collection. To evaluate CSI amount, we propose a calculating method which takes into account both the number of channels to be estimated and the length of CSI transmission route. We suppose that the central controller is located at the middle cluster of the multihop path. In the optimal scheme, the CSI of faraway clusters is delivered to the central controller via multihop transmission. Thus, we calculate the weighted sum CSI amount, and let the weights be the numbers of hops needed to collect the CSI. The Q-learning-based scheme and conventional decentralized scheme only require CSI transmission between adjacent clusters, so all weights are 1. The weighted CSI amount of the optimal scheme is calculated by (13). It is not difficult to prove that C(M, K) is always greater than (M − 1)K 2 + 2K, CSI amount of the Q-learning-based scheme, for all values of M and K. The three schemes are compared in Table 4.

Conclusions
In this paper, we have proposed a decentralized Q-learning-based relay selection scheme for multihop clustered networks. The scheme is composed of three phases: initialization, training, and prediction. A new update function for Q-values is designed to promote prediction performance. Simulation results show that the proposed Q-learning scheme achieves near-optimal performance and outperforms conventional decentralized scheme in terms of average E2E rate. The advantages of Q-learning scheme also lie in lower computational complexity and smaller cost for collecting CSI.

Conflicts of Interest:
The authors declare no conflict of interest.