The major components of the proposed Adaptive HARQ Q-learning model include the agent, state, action, and reward calculator. The agent serves as a representation of the 5G NR system, which performs an action according to the Q-Table obtained by reinforcement learning. The action here refers to when and how many times to transmit or retransmit the packet. The transmission results will affect the system state. The feedback from the system will determine the rewards obtained using the reward calculator and the optimization of the Q-Table. The objective of the proposed Q-learning method is to identify a set that will optimize the overall cumulative reward. The following is a detailed description of the model sections.
4.1.2. State Space
The setting of space will determine whether the algorithm can converge and affect the analysis and design of reward. We employ innovative methods to categorize the data transmitted by 5G NR-V2X by BLER, , Distance and NumRBs.
As is clear from the previous introduction to the HARQ process, the delay in processing by the receiver decoder has a great impact on the transmission RTT. More NACKs being returned means that more incorrect data are transmitted. It also means that the more times that retransmission is needed, the correspondingly greater the delay accumulation of the receiver decoder. Therefore, reducing the BLER can enhance the reliability of information transmission and reduce the delay. The BLER can be calculated as in Equation (2):
In addition to the important index of the downlink BLER(BLER), the design state space also takes the average downlink Channel Quality Indication (), UE distance (), and the number of resource blocks (NumRBs) into account. They all belong to discrete and relatively independent data, which together constitute the State space of the Q-learning model.
When the
ranges from 0 to 2000 m, the
NumRB value ranges from 0 to 120. The UE distance is divided by a section every 100 m for a total of 20 sections. The number of resource blocks is divided into five intervals for a total of 20 intervals. The intervals of the UE distance (
) and the resource block numbers (
NumRBs) are arranged and combined, and the two state elements of the downlink BLER (
BLER) and the downlink average CQI (
) are included to jointly establish the state space. Each divided state represents a class of transmission scenarios, including all possible transmission scenarios. Finally, the state space
. The state-space table is shown in
Table 1.
4.1.3. Action Space
A simple and efficient action-space design can reduce the difficulty of convergence and improve the training speed. The designed action space in this study is
, giving a total of nine discrete program actions, where A0 is the 5G HARQ transmission scheme shown in
Figure 1, A1 and A2 are
T-delay scheme schemes, A3 and A4 are
K-repetition schemes, and A5–A8 are [T, K]-overlap schemes. Details are shown in
Table 2, where A1–A8 correspond to the eight HARQ alternative transmission schemes shown in
Figure 3.
Although a larger number of retransmissions increases the success rate of packet transmission, the waiting time for the receiver to confirm the NACK signal and retransmitting will greatly affect the delay during retransmission. So, in order to save the time and resources required for retransmission, we limit the K-repetition scheme up to 3 times, that is, 1 normal transmission plus 2 consecutive retransmissions. The correct selection of the designed optimized transmission scheme can avoid the waste caused by the redundancy of channel resources when the channel quality is good and the long delay and high BLER caused by repeated retransmission when the channel quality is poor.
For example, when CQI is poor, the probability of information transmission failure is large. In this case, the information can be sent with the T-delay scheme (delay 1 slot or delay 2 slots) to improve the success rate of the information transmission. Because an appropriate delay transmission scheme can avoid an environment with poor CQI, there is a greater probability of reducing the average BLER while releasing more limited channel resources at the same time. It is worth noting that although it seems that the T-delay scheme will temporarily waste time, the fact is that, the average number of retransmissions is reduced because it may avoid interference. Consequently, both the overall transmission time and the average delay are subsequently reduced.
When the user is far away, there exists a high probability of information transmission failure. In this case, the information can be sent with the K-repetition scheme (send 2 or 3 repetitions) to enhance the success rate of the information transmission and reduce the average transmission delay. Because the appropriate repetition transmission mechanism can save the waiting time after the receiver confirms the NACK signal when the original retransmission is required, this reduces the average BLER. It is worth noting that the use of a continuous repetition transmission mechanism will temporarily borrow channel resources; however, due to the improvement in the transmission success rate, the original required number of retransmissions is greatly reduced, and the channel resources will also make up for it.
Moreover, when the distance is far and the CQI is poor at the same time, the [T, K]-overlap scheme of the repetition sent after delay can be selected. The selection of a transmission scheme cannot focus on a single parameter but needs to comprehensively consider factors such as the CQI, UE distance, and the number of resource blocks. The 5G HARQ and eight designed alternatives together constitute the action space of the A-HARQ scheme in this study.
4.1.4. Reward Calculator
The objective of the A-HARQ mechanism is to select the appropriate transmission scheme for each packet to reduce the entire retransmission in the 5G NR-V2X physical layer. The objective of Q-learning in the A-HARQ employs a learning model to choose the optimal scheme that can reduce the average BLER across various transmission environments.
Based on our design, the selection of transmission scheme (A1–A8) is used in the case that the transmission fails to meet the BLER requirements, while the transmission scheme of the original 5G HARQ (A0) is used for cases that meet the BLER requirement. Therefore, it is possible to initially set the maximum tolerable BLER threshold, denoted as , which can be adjusted as necessary. In our observation, the BLER threshold is set as in the simulation environment.
The original 5G HARQ BLER parameter is denoted as BLERHARQ, and the proposed adaptive HARQ BLER parameter is denoted as BLERA-HARQ. The design of the Reward Calculator is mainly based on whether the BLER is not greater than the .
When the value of the BLERHARQ is smaller than the , two reward rules, and , can be selected in the Reward Calculator, which are set as follows:
The reward rule corresponding to the state that meets the requirements of the BLER, i.e.,
, is
defined as in Equation (3):
In this case, choosing the 5G HARQ scheme (A0) will result in a positive reward, while choosing other optimized transmission schemes (A1–A8) will result in a punishment.
The reward rule corresponding to the state that does not meet the requirements of the BLER, i.e., , is , defined as in Equation (4):
In the case of the unsatisfied BLER threshold, the real change in BLER (BLERHARQ − BLERA-HARQ) after execution of the A-HARQ decision is added as the main basis, and factors such as distance, CQI, and RB number are also considered. In this case, the selection of the 5G HARQ scheme (A0) will be punished, whereas the selection of other optimized transmission schemes (A1–A8) will be rewarded or punished correspondingly according to the key factor of the BLER change. The reward rule of consists of the base reward and the reinforcement reward .
The base reward
reflecting the BLER reduction, can be expressed as Equation (5):
where
is the reward coefficient, set at
τ = 100. The smaller the
BLERA-HARQ after the A-HARQ scheme, the greater the reward
.
If the transmission scheme only pursues the goal of reducing the BLER while ignoring the actual situation of the current network environment quality and available resources, it will cause a waste of limited time and space resources. Therefore, in the formulation of the reward function in this study, in addition to reducing the BLER, we also need to comprehensively consider the quality of the channel, UE distance, and available resource blocks under various circumstances. Only in this way can we achieve a reasonable allocation of transmission time slots and limited channel resources, reduce the total transmission time, and improve the reliability of data transmission. Therefore, the reinforcement reward
is set to help the agent to find the optimal strategy and is defined as in Equation (6):
where
,
, and
are reinforcement reward values corresponding to the CQI, UE distance, and available resource blocks, respectively. They comprise reward coefficients and their normalized systems, as shown in
Table 3.
According to the eight different optimal transmission schemes of A-HARQ, the proportions of the CQI, distance, and available resource blocks are different. For example, the channel quality is considered more important when the decision is made to use the action of delay transmission. However, the transmission distance is more important when the repetition transmission action is chosen. Accordingly, the coefficient weights of the more focused factors are the largest. On the contrary, the channel quality is an important factor to consider in the length of the time delay. And the distance is an important consideration when considering the number of repetitive transmissions. At the same time, the amount of available resource blocks is another important factor to be considered to postpone the sending time and/or to reduce the number of repetitive transmissions.
Specifically, for the CQI factor, the worse the CQI is when the T-delay scheme is selected, the bigger the reward that will be obtained, while the better the CQI is when the K-repetition scheme is selected, the bigger the reward that will be obtained. And the opposite is true for the factors of distance and the number of resource blocks. The greater the distance and the number of resource blocks, the greater the reward of repeated transmission, while the smaller the distance and number of resource blocks, the greater the reward of the delayed transmission behavior.