This section presents the proposed Q-learning AODV (QL-AODV) routing protocol for AAV networks. It begins by describing the AAV network model and then summarizes the main features of the baseline AODV protocol. Next, the detailed design of QL-AODV is introduced, including the extended packet structure, the core components of the Q-learning algorithm, the overall operational workflow, and the specific algorithms employed.
3.1. AAV Network Model
The network under consideration consists of a set of AAVs, NAAV, operating within a defined three-dimensional space to conduct cooperative tasks such as surveillance, data collection, or communication support. These AAVs form a flying ad hoc network in which every vehicle is self-organizing and self-configuring and functions as a network router. The AAV topology is highly dynamic owing to continuous movement, which causes rapid variations in inter-node connectivity. Each AAV is equipped with a wireless interface and can communicate with its peers (AAV-to-AAV, U2U) and with the base station (AAV-to-base station, U2B) within a given transmission range. Every AAV maintains a buffer for packets awaiting forwarding. The network operates in a 6G context that promises wide bandwidth, low latency, and native AI integration, and can interact with fog-computing nodes to enhance processing capacity and reduce latency for delay-sensitive applications.
The AAV network is modeled as a time-dependent undirected graph , where V(t) represents the set of AAVs and base stations at time t, and E(t) denotes the set of wireless links between nodes that lie within mutual transmission range. The proposed QL-AODV algorithm is built on the AODV routing framework and incorporates RL via Q-learning to optimize path selection. The principal enhancements include an extended RREP packet that conveys additional network-state information, the definition of Q-Learning components, and modifications to the route-selection procedure.
Figure 2 presents the proposed extended RREP packet format for the proposed QL-AODV. To provide visibility into potential congestion along a route, we extend the AODV RREP structure. Based on the implementation in
aodv-packet.h (class
RrepHeader) and
aodv-packet.cc, two new fields are added.
- (1)
uint32_t m_totalBuffer
Stores the total buffer occupancy aggregated over all nodes from the RREP originator (destination or intermediate) back to the source. The unit may be the number of packets or a percentage. In the proposed implementation, (aodv-routing-protocol.cc, function GetBufferOccupancy) represents queue occupancy in percentage.
- (2)
uint32_t m_maxBuffer
Stores the maximum buffer occupancy observed at any node along the route.
When a node (destination or intermediate) generates an RREP, it initializes these fields with its own buffer information. As the RREP propagates back to the source, each intermediate node updates the fields as follows.
localBuffer = GetBufferOccupancy()
RREP.total_buffer = RREP.total_buffer + localBuffer
RREP.max_buffer = MAX(RREP.max_buffer, localBuffer)
To enable buffer-aware routing decisions, we extend the standard AODV RREP message format from 19 bytes to 27 bytes. This means that when the source node receives an RREP, it holds aggregated buffer information for the entire route, enabling more intelligent path selection. The new RREP is eight bytes larger than its legacy AODV counterpart. To quantify congestion for QL-AODV routing decisions, we measure the occupancy of the AODV routing queue (RequestQueue) using a sliding time window. Each node updates two windowed statistics in the background, expressed as a percentage of queue capacity: the windowed mean and the windowed maximum . When an RREP traverses a node, the node reads these pre-computed values and writes them into the header field for aggregation along the path. At the source, the path-mean and path-maximum are normalized to (avgBufNorm and maxBufNorm) and used as the Q-learning state. This design suppresses instantaneous noise from micro-bursts, preserves sensitivity to short congestion peaks, and adds no control-plane latency, since all computations run periodically in the background.
Our QL-AODV implementation monitors the AODV routing buffer (RequestQueue) rather than the MAC layer transmission buffer. The AODV routing buffer stores packets awaiting route discovery at the network layer, while the MAC buffer stores packets with established routes awaiting channel access. We focus on AODV buffer occupancy because: (1) it directly reflects routing protocol performance issues, (2) its congestion triggers costly route rediscovery processes in high-mobility UAV networks, and (3) it provides actionable information for routing optimization decisions. The buffer occupancy fields in our extended RREP format (m_totalBuffer and m_maxBuffer) specifically measure the percentage occupancy of AODV routing queues, enabling intelligent congestion-aware route selection.
3.2. Q-Learning Components
Q-Learning is a model-free, off-policy RL algorithm that seeks to learn an action-value function, denoted , estimating the expected cumulative reward obtained by performing the action in state . Within the proposed QL-AODV framework, the main Q-Learning components are defined as follows.
- (a)
State space (δ)
A state characterizes the quality of a potential route from the source to the destination, constructed from information carried in the RREPs received at the source. Specifically,
s is a normalized three-tuple.
where
hnorm is the normalized hop count,
bavg_norm is the normalized average buffer occupancy along the route, and
bmax_norm is the normalized maximum buffer occupancy of any node on the route. The normalized hop count (
hnorm) is the hop count h to the destination, normalized by a constant. It can be calculated as follows with
.
The normalized average buffer occupancy along the route (
bavg_norm) is obtained by dividing the cumulative buffer occupancy
Btotal (field m
total-Buffer in the RREP, i.e., the sum of per-node occupancy percentages) by the hop count h and normalized by the maximum node buffer occupancy
Bnode_max%. It can be calculated as follows.
The normalized maximum buffer occupancy of any node on the route (
bmax_norm) is given as follows.
where
Bmax_val is taken from the
m_maxBuffer field of the RREP. Normalization confines all state components to
, placing them on a standard scale and thereby enhancing learning stability and efficiency.
- (b)
Action space )
If the source collects multiple RREPs for the same destination during the interval , it forms a set of candidate routes with . Each route is identified by an index (0 … ). An action corresponds to selecting a route . For data forwarding, the action space is .
- (c)
Reward function (R)
The reward
quantifies the immediate effectiveness of choosing an action
(route
) in the state
. The proposed QL-AODV employs a simple binary reward as follows.
Determining “success” or “failure” is crucial. In hop-by-hop AODV routing, success can be defined as receiving a MAC-layer acknowledgement (IEEE 802.11 MAC ACK) for a packet forwarded to the next hop. Conversely, failure is recorded when the MAC layer reports a transmission error, or no ACK is received within a specified timeout. This binary outcome supplies the Boolean success input to the UpdateQValue function in the implementation.
- (d)
Q-value update rule
The action-value
, denoting the expected cumulative reward obtained by acting
in state
and thereafter, following the optimal policy, is iteratively updated through the Bellman equation.
where
k is the learning-iteration index, α is the learning rate, γ is the discount factor,
s is the current system state before choosing an action
,
is the next system state after the action
is executed, and
is the maximum estimated Q-value in state
, obtained by selecting the optimal action
.
In the proposed implementation, the term (maxNextQ) is computed by scanning the largest Q-values of all feasible actions from the candidate states associated with the destination under consideration. The Q-table, which stores the values , is implemented as a “nested std::map”; the outer key is the destination’s IP address, the inner key is a QState object (hashed via QStateHash for use in std::unordered_map), and the final value is a std::vector<double> in which each element corresponds to the Q-value of an action (a routeId) available from that state.
3.3. Operational Process of the QL-AODV
- A.
Phase 1—Route Discovery and Information Collection
The first phase of the proposed QL-AODV algorithm begins when a source node (S) has data to transmit to a destination node (D) without having a valid route. In such a case, S initiates a route request (RREQ) broadcast similar to the regular AODV protocol. This RREQ propagation is used to search current paths to the destination by traversing the network. Whenever the RREQ packet is forwarded from one intermediate node to another, the destination node constructs a fresh reverse route or updates an old one to the source node. In the situation where an intermediate node possesses a valid and sufficiently fresh route to the destination D (the “Destination Only” flag having been cleared), or in the situation where the node is the destination itself, it replies by sending a RREP packet.
At this stage, the RREP originator node, denoted as Nrrep, initiates buffer-awareness by placing its own buffer usage statistics in the RREP stream. Specifically, it places the values of m_totalBuffer and m_maxBuffer, indicating its current buffer state. With the RREP being forwarded back towards the source node in the reverse direction, each forwarding intermediate node (Nint) makes contributions to the buffer data aggregation. It does so by retrieving its local buffer occupation, localBuffer = N_int.GetBufferOccupancy(), and assigning the RREP fields accordingly. The sum buffer field (RREP.m_totalBuffer) is added with localBuffer, and the max buffer field (RREP.m_maxBuffer) is assigned to hold the maximum value achieved up to that point.
Finally, upon receiving RREPs from various paths, source node S remains idle for a fixed amount of time, i.e.,
m_rrepWaitTime (usually 300 milliseconds) to accumulate multiple RREPs concerning destination D. Each received RREP is stored along with details of its sender, reception time, and related metadata in an ordered buffer
m_rrepBuffer[D]. This collection procedure is controlled by modules such as
RecvReply and
ProcessCollectedRreps, the responsibility of which is to parse and organize the collected route replies. This buffer-aware route discovery procedure makes the source node aware not only of the connectivity state but also of the congestion levels along potential paths, enabling intelligent and efficient route selection in subsequent steps of the algorithm.
Figure 3 presents the flowchart of the proposed buffer-aware RREP processing approach. Furthermore, Algorithm 1 provides the pseudo-code of the proposed buffer-aware RREP processing.
| Algorithm 1: Buffer-aware RREP processing (Handle incoming RREP) |
![Futureinternet 17 00473 i001 Futureinternet 17 00473 i001]() |
- B.
Phase 2—Q-Learning-Based Route Selection
Figure 4 presents the main steps of the proposed Q-learning-based route selection scheme. Also, Algorithm 2 provides the pseudo-code of the proposed route selection model. Once the first phase’s route discovery and buffer information collection are complete and
m_rrepWaitTime has lapsed, the source node (S) commences the decision-making process with a call to the
ProcessCollectedRreps function. This marks the initiation of the second phase in which RL in the form of Q-learning is employed to make an enlightened selection of the optimal route out of multiple potential routes. For each RREP that is received and queued in
m_rrepBuffer[D], the source node computes a state vector for each path quality. The vector s is calculated by calling the
CalculateState function. The components of the vector are normalized values representing the hop distance to the destination node, the path’s average buffer usage, and the maximum buffer that was seen, respectively.
Normalization is employed to make all the metrics comparable on the same range, a crucial factor for making effective decisions and learning. Finally, a list of not more than MAX_ROUTES (typically set as 10) possible candidate routes and their parameters is selected and stored in a specific data structure m_multiRoutes[D]. They hold essential metadata such as the nextHop, hopCount, sequence number (seqNo), totalBuffer, maxBuffer, route lifetime, and a unique routeId. These candidate routes are ready for Q-learning evaluation.
In the Q-learning model, the source node maintains a Q-table (m_qTable) that maps state-action pairs to Q-values, which are the expected gain (reward) of going along a specific path under a given network state. If a just-calculated (state, action) pair is not found in the Q-table, it is initialized as a default Q-value, generally to zero or a tiny random number. This initialization enables the learning algorithm to begin assigning rewards for the success or failure of subsequent packet transfers. The second is selecting a route (action) from the candidate set based on an ε-greedy policy, exploring and exploiting. With probability ε (initially 0.5), the algorithm selects a random routeId from available candidates in m_multiRoutes[D], stimulating exploration of diverse paths.
With the remaining probability 1 − ε, the algorithm chooses the routeId which has the most significant current Q-value Q(s, a), thereby leveraging known good-performing routes. ε’s value is decayed by a factor (m_epsilonDecay = 0.995) at each decision step, until it is down to a minimum (m_epsilonMin = 0.1), increasingly favoring exploitation as learning advances. Once a route is selected, referred to as selectedRoute, its information is installed in the master routing table (m_routingTable). The state vector and the selection’s routeId are also saved in the corresponding RoutingTableEntry. This will be necessary for future reinforcement, as the same information will be needed to update the Q-value depending on observed outcomes.
Following the path installation, all packets that were queued in the buffer of the source node (
m_queue) to be delivered to destination D are forwarded along the chosen path. At the same time, a feedback process is also being set up. That is, a state-action (
routeId)-destination (D) tuple is placed in the
m_pendingFeedback buffer. This tuple will be called once the transmission outcome (success or failure) is established, allowing the Q-learning algorithm to reinforce by adjusting the Q-value for the (state, action) pair, and hence learn through experience and improve future route choice decisions. This Q-learning-based approach enables the system to learn and optimize routing decisions adaptively in highly dynamic AAV networks, according to current network conditions such as buffer usage and hop distance, and continuously refine its strategy by interacting with the environment.
| Algorithm 2: Q-Learning-driven route selection |
![Futureinternet 17 00473 i002 Futureinternet 17 00473 i002]() |
- C.
Phase 3—Q-value Update
The third phase of the proposed model is responsible for enhancing learning through feedback-driven updates to the Q-values controlling routing decisions. After sending data packets across the selected route, the source node (S), or the lower-level routing subsystem, awaits MAC-layer feedback on whether the forwarding to the next hop was successful. The feedback is a simple Boolean value (success = true/false), representing the outcome of the transmission attempt. Upon receiving this feedback, the algorithm proceeds to invoke the UpdateQValue function. It uses the packetId (to locate the corresponding routing decision stored in m_pendingFeedback) and the success flag as input parameters. The feedback information enables the system to retrieve the original (state, action, destination) tuple for the packet forwarded.
Once this information is received, the algorithm proceeds to update the Q-value
Q(
s,
a) according to the classical Bellman equation that is the foundation of Q-learning. In the equation, the Q-value of the last state-action pair is updated in line with the reward received and estimated future utility. The reward (R) is +1 for successful transmission and −1 for failed transmission, rewarding good routing choices and penalizing poor ones. The update also incorporates the next state’s (
s′) maximum Q-value, i.e., maxa′
Q(
s′,
a′)). This term equals the maximum future reward expected from the next state and is computed by taking into account the same destination D’s current candidate routes. Through this term, the algorithm ensures that not only immediate feedback but also long-term outcomes are considered during learning. Through multiple cycles of feedback and Q-value update, the routing algorithm progressively refines its estimate of which routes perform best under changing network conditions. This iterative learning from feedback allows the AAV network to adapt in real time to mobility, congestion, and other dynamic conditions, ultimately leading to more reliable and efficient route choice.
Figure 5 presents the main steps of the proposed procedure for updating the Q value using the Bellman equation, and Algorithm 3 provides the pseudo-code.
| Algorithm 3: Update Q value using the Bellman equation |
![Futureinternet 17 00473 i003 Futureinternet 17 00473 i003]() |