Aris-RPL: A Multi-Objective Reinforcement Learning Framework for Adaptive and Load-Balanced Routing in IoT Networks

Najim Halloum; Ali Ahmadi; Yousef Darmani

doi:10.3390/fi18020072

,

and

¹

Faculty of Computer Engineering, K.N. Toosi University of Technology, Tehran 163171419, Iran

²

School of Administrative Studies, Adjunct Faculty at York University, 4700 Keele St, Toronto, ON M3J1P3, Canada

^*

Author to whom correspondence should be addressed.

Future Internet2026, 18(2), 72;https://doi.org/10.3390/fi18020072
(registering DOI)

This article belongs to the Special Issue Wireless Sensor Networks and Internet of Things

Version Notes

Order Reprints

Abstract

The fast-paced utilization of innovative Internet of Things (IoT) applications emphasizes the critical role that routing protocols play in designing an efficient communication system between network nodes. In this context, the lack of adaptive routing mechanisms in the standard Routing Protocol for Low-power and Lossy Networks (RPL), such as load balancing and congestion mechanisms, especially under heavy load scenarios, causes significant degradation of network performance. In this regard, integrating innovative and effective learning abilities, such as Reinforcement Learning, into an efficient routing policy has demonstrated promising solutions for future networks. Hence, this paper introduces Aris-RPL, an adaptive routing policy for the RPL protocol. Aris-RPL utilizes a multi-objective Q-learning algorithm to learn optimal paths. Each node translates neighboring node information into a Q-value representing a composite multi-objective metric, including Buffer Utilization, Energy Level, Received Signal Strength Indicator (RSSI), Overflow Ratio, and Child Count. Furthermore, Aris-RPL operates effectively during the exploitation and exploration phases and continuously monitors the network overflow ratio during exploitation to respond to sudden changes and maintain performance. The extensive Contiki OS 3.0/COOJA simulator experiments have verified Aris-RPL efficiency. It enhanced Control Overhead, Packet Delivery Ratio (PDR), End-to-End Delay (E2E Delay), and Energy Consumption results compared to other counterparts for all scenarios on average by 39%, 25%, 7%, and 38%, respectively.

Keywords:

reinforcement learning; Q-Learning; routing; RPL; congestion; load-balancing; IoT; WSN

1. Introduction

The Internet of Things (IoT), as a rapidly expanding communication ecosystem, has found widespread applications in many domains: industrial automation [1], sustainable cities [2], healthcare monitoring [3], digital transformation, etc. Wireless sensor networks (WSNs) are the cornerstone element in the IoT paradigm by playing the crucial role of sensing, gathering, and sending data for analysis using Internet-based technologies.

A WSN consists of numerous embedded devices known as sensor nodes that are limited in energy, storage resources, and computing capabilities. Moreover, the nature of these networks, characterized by frequent changes regarding node conditions (energy consumption, buffer utilization, etc.), network congestion, and link quality and traffic demands, makes establishing a communication system challenging. Among various communication mechanisms, routing protocols are crucial and necessary, given the expansion of IoT applications [4] and the network dynamics [5]. As a result, enhancing routing in modern IoT networks is vital and needs to become a multi-objective optimization problem, including packet delivery ratio (PDR), power usage, and control overhead ratio [6]. However, the Internet Engineering Task Force (IETF), the international organization for developing Internet standards, including the core protocols that comprise the TCP/IP Internet protocol suite, has proposed a low-power and lossy network routing protocol called RPL to provide a routing solution for resource-constrained IoT nodes [7].

While the standard RPL performs adequately under low-traffic conditions, it experiences a significant deterioration in performance regarding reliability, load balancing, and energy consumption in denser network environments [8]. Specifically, neither the standard RPL nor the inefficient proposed approaches could address issues related to load balancing and energy consumption [9]. Additionally, the diversity of IoT applications, from simple temperature measurements to more complex media data transfer, makes the traffic patterns varied, leading to congestion, packet loss, and eventually node failure [10]. As a result, optimizing RPL performance in such scenarios has gained significant research interest. This led to the development of various rule-based (fixed-based) routing policies to design more effective approaches for RPL networks [11]. Many approaches have been proposed to introduce enhanced solutions for RPL, considering quality of service (QoS) standards and the physical constraints of wireless sensor networks (WSNs) [12,13]. While rule-based solutions may perform well in specific conditions, several factors limit their effectiveness in network scenarios that are more complex, such as varied traffic patterns or unbalanced loads.

Therefore, enabling sustainable intelligent communication in a resource-constrained environment is one of the network design’s essential concerns [9]. Therefore, applying learning methods to RPL as a part of an efficient communication mechanism is critical to powerful intelligent IoT applications. In this context, Reinforcement Learning demonstrated promising solutions for networking in general and routing in particular [14]. For instance, wireless networking uses Reinforcement Learning to optimize networking functions by adapting to environmental changes and making intelligent long-term decisions [15]. Recently, Reinforcement Learning has been employed to enhance the RPL routing policy to address sequential and uncertain decision-making challenges [16]. Among Reinforcement Learning techniques, the Q-learning algorithm has confirmed its applicability for packet forwarding processes [17].

Consequently, this study introduces an optimization approach for RPL called Aris-RPL, providing intelligent routing decision-making for future resource-limited IoT networks. Specifically, unlike traditional approaches that face challenges to adapt to network changes adequately or existing RL-based protocols that often lack addressing load balancing explicitly or have higher overhead in larger networks, Aris-RPL distinguishes itself by introducing a novel hybrid three-phase framework with well-selected routing metrics. That is, Aris-RPL transitions to an exploitation phase probabilistically, suppresses DIO control packets to conserve energy, and during this suppression introduces a dedicated topology monitoring phase that continuously observes the Overflow Ratio. This allows the network to reactively trigger updates only when immediate congestion is detected. To achieve this, Aris-RPL leverages the dynamic decision-making capabilities of the Q-learning algorithm. Overall, the main contributions of this paper are as follows:

Introducing a Q-learning-based approach with a novel three-phase adaptive routing framework, including exploration, exploitation, and monitoring to effectively improve RPL data forwarding performance.
Utilizing well-selected routing metrics, including Buffer Utilization, Energy Level, Received Signal Strength Indicator (RSSI), Overflow Ratio, and Child Count, to provide nodes with the ability to observe the neighborhood.
Applying a hybrid topology monitoring mechanism that suppresses DIOs during stability and monitors overflow ratio to reactively trigger updates, balancing overhead and responsiveness.

The rest of the paper is structured as follows: Section 2 provides an overview of RPL and Reinforcement Learning principles. Section 3 highlights related works. Section 4 details Aris-RPLs methodology. Section 5 discusses the evaluation experiment results. Finally, Section 6 concludes the paper.

2. Preliminaries

This section highlights the fundamental concepts of the RPL routing protocol, machine learning in general, and the Q-learning algorithm in particular.

2.1. RPL Protocol Overview

IETF introduced RPL to standardize routing in the resource-restricted Low-power and Lossy Networks. RPL is a distance vector routing protocol. It works on top of IEEE 802.15.4 PHY and MAC layers in collection-based environments, allowing network nodes to regularly transmit environmental data to a central collection point [18]. RPL builds the tree-like Destination-Oriented Directed Acyclic Graph (DODAG). It is a Directed Acyclic Graph (DAG) with a singular node called the border router or the root node that provides a gateway to the Internet. The nodes have to specify routes inside DODAD to deliver data to the root in a one-hop or multi-hop manner.

According to Figure 1, RPL uses the control messages to build DODAG, including DODAG Information Object (DIO), DODAG Information Solicitation (DIS), Destination Advertisement Object (DAO), and DAO Acknowledgment (DAO-ACK) [18]. DIO is responsible for transmitting network information between IoT nodes. Network nodes utilize DIO and DIS messages to establish upward traffic routes, while DAO and DAO-Ack messages are used to support downward traffic routes.

Figure 1. Network Graph: (a) Before Using RPL; (b) After Using RPL.

The DODAG topology highly depends on the RPL objective function (OF). It is responsible for rank calculation based on routing metrics such as link quality, delay, and energy consumption, besides defining routing optimization objectives and constraints. In fact, OF controls the metrics’ conversion criteria into ranks to select and optimize routes within a DODAG [19]. Two standardized objective functions exist in the original version of RPL. Objective Function Zero (OF0) [20] and Minimum Rank with Hysteresis Objective Function (MRHOF) [21]. Depending on the employed node/link metric(s) carried in the metric container of the DODAG information object (DIO) and the objective function, every node running RPL selects its next forwarder (preferred parent).

To regulate the rate at which nodes generate control messages (especially DIOs), RPL uses a trickle algorithm, which adjusts the dissemination rate of these messages following the stability of the topology [22]. According to this timing mechanism, nodes transmit fewer control messages if the network is stable. The transmission interval is doubled each time stability is maintained, up to a maximum period. However, if instability conditions occur (e.g., link failure), the timer resets to the least value to promptly disseminate DODAG information updates. Figure 2 provides an overview of the RPL protocol.

Figure 2. RPL Description Summary.

2.2. Reinforcement Learning Overview

Machine learning is a branch of artificial intelligence science that includes methods to detect patterns and relations in data sets and provide a prediction model to predict future data [23]. Reinforcement Learning is a machine learning technique that mimics human behavior based on trial and error and learning from interactions with the environment without requiring prior knowledge [24].

Figure 3 depicts the principle of a Reinforcement Learning operation [25] where the agent at time step t is in some state, observes the environment, takes action, and receives a reward. Then, it moves to a new state. In the case of the routing process of sensor networks, the sensor node represents an agent that acquires knowledge about the unknown environment to make decisions. Q-learning is a widely recognized off-policy reinforcement learning algorithm. Q-learning depends on the Q-table, which stores Q-values representing the expected benefit of taking a specific action in a particular state.

These Q-values are estimated and updated according to the state-action value function in Equation (1) [14].

Q_{new} (s_{t}, a_{t}) = (1 - α) \times Q_{old} (s_{t}, a_{t}) + α \times [r_{t} + γ \times (Max (Q (s_{t + 1}, a_{t + 1})))]

(1)

where

Q_{old}

is the previous value of Q-value,

α

is the learning factor, and

r_{t}

is the reward for selecting an action

a_{t}

in state

s_{t}

. State

s_{t + 1}

indicates the next state, and

γ

is the discount factor.

One of the fundamental aspects of Reinforcement Learning is understanding the algorithm’s hyperparameters and the exploitation-exploration trade-off. However, the discount factor and the learning rate in Equation (1) are essential in optimizing the learning process. The learning rate

α

controls the influence of new information on existing knowledge. On the other hand,

γ

represents the discount factor (0 ≤

γ

≤ 1). It determines how much future rewards are valued [26].

In addition, the exploitation-exploration trade-off is an essential consideration in Reinforcement Learning. The agent must balance exploiting cumulative experience to make the best decisions based on available data and exploring new actions to acquire additional data. More exploration can lead to negative rewards, while too much exploitation can result in suboptimal long-term rewards by falling into local optima [14]. Various algorithms, such as the epsilon-greedy and SoftMax techniques, effectively manage this exploration-exploitation trade-off.

Figure 3. Reinforcement Learning Basic Principle.

3. Related Work

Much research has been conducted by the scientific community to address the inherent limitations of the standard RPL protocol in dynamic IoT environments. Based on the literature review, these approaches can be categorized into three main methodologies: rule-based optimization, fuzzy logic-based optimization, and reinforcement learning-based optimization.

In the context of rule-based approaches, several studies have focused on designing advanced objective functions (OFs) to enhance routing performance in RPL-based networks. For instance, authors in [27] proposed the Queue Utilization-based RPL (QU-RPL). It used the Queue Utilization (QU) routing metric to indicate a node’s congestion level. Furthermore, QU-RPL integrates QU with Hop Count (HC) and Expected Transmission Count (ETX) to design an objective function to reduce congestion and balance the load across the network. However, not considering other metrics like energy consumption restricted its applicability to dynamic network changes. Similarly, reference [28] introduces a new metric named CER to develop the objective function, taking into account various routing metrics such as ETX, Queue Utilization (QU), node lifetime, latency, packet latency, and bottleneck node number routing, but it was limited by not handling dynamics in larger networks. The authors in [29] introduced an objective function named Brad-OF, which incorporates a group of routing metrics, including ETX, remaining energy, and delay. Additionally, the research efforts here introduced a traffic intensity metric to evaluate node buffer congestion, aiming at congestion detection and avoidance, but did not adequately address scalability in large networks. Authors in [30] proposed a method named L2RMR to improve QoS and balance the graph and load. The proposed objective function evaluated a composite metric based on the parent conditions, number of children, and parent rank. Nodes then selected parents with the support of congestion control and load balancing, yet it could not handle network dynamics effectively. The elaborated cross-layer objective function to achieve energy efficiency (ELITE) [31] aims to enhance energy efficiency to co-optimize the MAC and the RPL network layers, but its performance was constrained under high traffic loads with high end-to-end delay. The WRF-RPL protocol [32] was introduced as an enhancement to RPL, specifically targeting performance improvements in congested scenarios. WRF-RPLs objective function evaluated node conditions based on metrics carried by DIO messages. These metrics are congestion rates, hop count, and remaining energy. Then, it determined node priorities and eventually selected the preferred nodes. However, it lacked metrics for link quality. The L-RPL protocol [33] was proposed to enhance communication and routing quality within RPL-based networks. The main mechanism of the L-RPLs objective function is to consider ETX and link estimation in path selection. L-RPL enhanced packet delivery ratio and energy management. However, it did not incorporate congestion-related metrics and experienced high overhead. ELBRP [34] modified the standard RPLs objective function to improve the quality of service, load-balancing, and power usage between nodes. ELBRP made routing decisions based on data traffic distribution. It prioritizes routing quality under low traffic conditions and uses multi-metric parent selections to maintain load balance under heavy traffic. However, it suffered from challenges regarding high overhead and handling dynamic node conditions.

Fuzzy logic methods have been applied to balance different metrics in RPL routing. A notable method named the Fuzzy Logic-based objective function (OF-FL) [35] incorporated link quality, E2E delay, Hop Count (HC), link quality, and energy consumption to design an effective objective function using fuzzy rules. However, its reliance on predefined rules limited its adaptability to changing network conditions. The authors in [36] proposed an opportunistic fuzzy logic-based objective function named OPP-FL. Here, authors measured parent load based on its child count and optimized reliability by integrating the neighbor load metric, the ETX, and the Hop Count (HC) with fuzzy logic, but it struggled with scalability in dense network environments and handling network dynamics.

With the emergence of machine learning techniques, recent literature in the field of LLNs has shifted from traditional rule-based routing frameworks toward more adaptive and intelligent frameworks. In this regard, there is a growing interest in applying reinforcement learning (RL) methods to enhance network performance in wireless environments [37,38,39]. For example, authors in [40] proposed an objective function that employs the learning automata algorithm, which is an RL method, to evaluate the quality of candidate parents based on metrics such as ETX and packet loss ratio. However, it did not address load balancing explicitly. Authors in [9] applied a multi-armed bandit algorithm, which is an RL method, in low-power and lossy IoT devices. The authors claimed that the multi-armed bandit (MAB) algorithm is more suitable than the Q-learning algorithm. The study proposed a network metric, EEX, that represents a new definition of the ETX metric regarding energy, for constructing a DODAG to enhance the network performance. Network scenarios with variable traffic rates were designed to evaluate the approach. Simulation results show that the RPL improves PDR and the control overhead ratio. The Q-Learning-based intelligent Collision Probability Learning Algorithm (iCPLA) [26] used a cross-layer optimization approach by allowing the RPL protocol in the network layer to access information from other layers, such as MAC contention metrics (collision probability), to enhance the RPL process. However, it experienced increased computational complexity as the network grew. Moreover, the work presented in [41] introduced a Reinforcement Learning-based routing approach with a modified Trickle timer strategy designed to enhance RPL network performance by alleviating congestion. This approach used the Q-Learning algorithm to build the Q-table using a value function incorporating metrics such as buffer utilization, Hop Count (HC), and ETX to update Q-values. Simulation results demonstrated improved packet delivery, control overhead, and average delay. However, the approach primarily focused on congestion control without integrating a broader range of network conditions, like energy consumption. The study presented in [42] introduced RI-RPL, a multi-stage and Q-learning-based extension of the RPL protocol aimed at enhancing adaptivity to network conditions and improving RPL performance. The Q-Learning algorithm built the Q-table using a reward function incorporating Transmission Success Probability, Residual energy, buffer utilization, Hop Count (HC), and ETX metrics. Simulation results demonstrated improved PDR, end-to-end delay, energy consumption, and network throughput. However, RI-RPL showed high overhead in dense network scenarios and did not consider load-balancing situations.

Besides the previous reinforcement learning, an emerging research direction has started applying advanced Deep Reinforcement Learning (DRL) techniques to address complex IoT routing challenges and showing potential future solutions to improve Aris-RPL to better handle network dynamics and cooperation between network nodes. For instance, the work in [43] proposed a hybrid architecture combining DRL with Graph Neural Networks (GNN) to better adapt to dynamic network topologies and enhance DRL decision-making. This work can be used in edge- or cloud-based scenarios because of the high computational cost. Most recently, Authors in [44] presented a Federated Double Deep Q-Learning (DDQN) approach, enabling sensors to collaboratively train routing models for latency and energy optimization without sharing raw data and with increased synchronization overhead. Recent works such as [45] demonstrate the efficacy of reinforcement learning in handling distributed resource management challenges found in WSNs. The authors proposed a distributed deep reinforcement learning framework that models the interaction between Wireless Devices and Roadside Units in Cooperative Intelligent Transportation Systems (C-ITS) as a Stackelberg game to address limited battery energy and computing power, validating their approach through numerical simulations. However, their work focused on resource allocation and offloading layers rather than the network layer of LLNs. Additionally, the research work in [46] proposed a distributed routing algorithm for multi-hop LoRa networks based on Multi-Armed Bandit (MAB) learning. The approach achieved a reduction in energy consumption and an improvement in PDR. While effective in its domain, it mainly focused on message-based LoRa communication, which fundamentally differs from the IEEE 802.15.4 standard framework utilized by Aris-RPL. Moreover, the researchers in [47] proposed a Multi-Armed Bandit Bayesian-based algorithm for WSN environmental. The protocol achieved throughput, scalability, and control overhead enhancements.. However, it relied on a centralized SDN-based controller for managing routing tables and calculating rewards, which contrasts with the fully distributed architecture utilized by our proposed solution.

Building upon these existing approaches, our proposed Aris-RPL integrates multiple key metrics into a Q-Learning framework, offering a more efficient solution in dynamic and larger IoT networks. Specifically, Aris-RPL uses a novel composite metric that uniquely combines a proactive congestion metric (buffer utilization) with a reactive one (overflow ratio), effectively and explicitly managing load balancing (child count). This combination is not found, for example, in references [41,42]. Second, and most critically, it introduces a unique three-phase framework (explore, exploit, monitor). In this regard, our method during the monitoring phase actively suppresses DIO control packets during exploitation to reduce overhead, while simultaneously monitoring the overflow ratio (OFR) to trigger updates only when congestion is detected reactively. This hybrid strategy for balancing overhead and responsiveness is the primary novelty of our work. Table 1 provides a comparative summary of these related works, highlighting their methodologies, metrics, contributions, and limitations.

Table 1. Summary of Related Works.

4. Methodology of the Proposed Routing Mechanism

Relying on a single routing metric or inappropriate composite metrics in designing routing criteria for IoT wireless sensor networks is insufficient. As shown in Figure 4, when multiple sensor nodes transmit packets to a single intermediate node (N3), the incoming packet rate increases, and due to the MAC layer contention, processing time, and resource limitations, this leads to overflow and packet loss. The problem is worsened by varying traffic patterns. Even if the incoming and outgoing node traffic rates are equal, the limited buffer size of WSN nodes, typically 4 to 10 packets [41], significantly impacts the packet loss rate. In addition, when a parent node has multiple child nodes, it can become overloaded and more prone to failure [10].

Therefore, these factors tremendously affect the overall packet delivery ratio, especially in dense networks. As a result, the protocol design must take into consideration routing metrics that reflect the actual load condition between nodes as a decision metric to solve the load-balancing problem in uneven traffic load patterns. The original RPL versions (OF0 and MRHOF) do not consider these aspects. In that case, parent switches occur frequently. According to RFC 6550 [18,48], a parent switch is regarded as an inconsistency (or instability) in the network. This is why the RPL resets the trickle algorithm timer to transmit the control packets frequently in the DODAG and recover stability. Authors in [49] showed how selecting an unsuitable next-hop forwarder affects the frequency of parent switches in the network. The study concluded that a higher frequency of inappropriate parent selection leads to more frequent parent switches. Consequently, this increases the energy usage and control overhead in the nodes. Reinforcement learning techniques can better adjust RPL in such conditions by considering effective load-balancing and congestion-aware routing metrics to integrate a self-learning and self-adaptive routing model into the RPL routing protocol.

Figure 4. Example of Load Balancing Problem in Sensor Networks.

Accordingly, this paper introduces a load-balanced, energy-efficient, and congestion-aware routing mechanism called Aris-RPL for RPL-based IoT networks, leveraging Q-learning to enhance routing. The proposed approach operates in three main phases: (1) After network initialization, the exploration phase explores potential actions for network status. (2) The exploitation phase exploits the cumulative experience to make optimal routing decisions. (3) The network topology monitoring and management phase monitors network topology to respond to rapid network changes, ensuring performance and stability. Figure 5 shows the sequence of the approach’s three phases. The following sections will present further details.

Figure 5. Proposed General Execution Sequence.

4.1. System Model and Assumptions

Aris-RPL models the routing optimization problem as a Markov Decision Process (MDP), which provides a general framework for defining states, actions, and rewards. The MDP is characterized by the tuple (E, S, A, R), and its solution in this study is approached using the Q-Learning algorithm by finding the optimal routing policy. Here is a detailed description of the MDP components and how they integrate with the Q-Learning algorithm:

State Space $S$ : is a set of finite states in the Environment $E$ that represents the WSN network. Every node is defined as a current state in which the agent takes an action. Here, Q-Learning uses the current state to identify potential actions and evaluate their expected rewards. The next node becomes the new state. The state space is defined as $S = {s_{1}, s_{2}, s_{3}, \dots, s_{n}}$ .
Action Space $A$ : when the agent resides in the state $s_{i}$ , it observes the environment, takes an action $a$ , and then the current state $s_{i}$ moves to the next state $s_{i + 1}$ . An agent (the node) selects one of its neighbors as the next forwarder. Therefore, the current node’s neighbor set $NB$ represents the action set as $A = {a_{1}, a_{2}, a_{3}, \dots, a_{k}}$ where $k$ represents the node’s neighbor number. Q-Learning leverages this action space to update Q-values for each possible action in the current state and refine its decision-making process.
Reward $R$ is calculated by the agent upon receiving a signal from a state in the environment based on the employed node/link metrics. According to (1), the Q-learning agent will learn all available actions in each state and update the required Q-values using the reward signals. Finally, it can choose the best action with the maximum cumulative reward. More details about the reward will be in the reward function section.

It is worth noting that Aris-RPL employs Q-learning, which is a model-free reinforcement learning algorithm, meaning that there is no need for an explicit, predefined model of the environment’s state transition probabilities

P (s_{t + 1} | s_{t}, a)

or reward distribution. More specifically, the agent learns the optimal policy directly by iteratively interacting with and sampling the complex and dynamic WSN environment, updating its Q-values based on observed rewards.

The proposed design considers the following assumptions in network design:

Distributed Network: The network is decentralized, without any central controller.
Adherence to RPL Standards: Aris-RPL follows the RPL standards. All nodes (excluding the root node) are homogeneous regarding resource constraints such as power and radio range specifications.
Symmetrical and Bidirectional Interactions: Communication between nodes is bidirectional and symmetrical. Network connectivity is dynamic according to network and node condition changes.
Single/Multi-Hop Communication: The sink receives data packets from network nodes via single or multi-hop paths, depending on the nodes’ position.

4.2. Routing Metrics Used in Aris-RPL

Aris-RPL uses five routing metrics reflecting the status of the links and neighbors in the network. These metrics are Received Signal Strength Indicator (RSSI), Buffer Utilization, Overflow Ratio, Child Count, and Reminder Energy.

4.2.1. Received Signal Strength Indicator (RSSI)

The first metric is RSSI. This metric determines the strength of the received signal on the receiving side. Wireless signal measurements, such as indoor localization systems, use this metric widely. However, RSSI appears as an excellent candidate to ensure high Packet Reception Rate (PRR) results due to the high dependency between RSSI and PRR [50,51]. RSSI values are typically measured in -dBm, where a more negative value indicates a weaker signal. The node reception unit repeatedly calculates the neighbor’s RSSI value.

4.2.2. Buffer Utilization (BU)

It measures how much of the buffer’s capacity is currently being used. This metric helps monitor the buffer status and predict potential congestion. Therefore, it helps to make adjustments to prevent overflow as much as possible. The following equation calculates the free space of the node’s buffer:

B U_{i} = \frac{N P_{i}}{B S_{i}}

(2)

where

{NP}_{i}

is the number of packets currently in node i’s buffer,

{BS}_{i}

is the total buffer size of node i. As a result, the buffer utilization typically ranges from 0 (entirely free capacity) to 1 (occupied capacity).

4.2.3. Overflow Ratio (OFR)

In our proposed approach, the overflow ratio (OFR) [16] assesses the congestion status of a node’s buffer by measuring the ratio of lost packets to the total number of packets handled by the node under heavy traffic. It represents a reactive measure to mitigate network congestion by identifying nodes that are dropping packets already and adjusting the load accordingly. To effectively calculate the OFR metric and capture the dynamic nature of network congestion, we need to measure it within specific time intervals. This interval will be set to the initial assignment of the trickle algorithm’s transmission interval t. To provide a balanced view of the node’s OFR, it is smoothed by the exponentially weighted moving average (EWMA) filter as follows:

{OFR}_{i} = \frac{{NPL}_{i}}{{NPR}_{i} + {NPS}_{i}}

(3)

{OFR}_{i} = (σ \times {OFR}_{iold}) + (1 - σ) \times {OFR}_{inew}

(4)

where

{NPL}_{i}

,

{NPR}_{i}

, and

{NPS}_{i}

represent the number of nodes i’s overflowed packets (lost), the number of child packets received by node i, and the number of nodes i’s generated packets, respectively. The parameter σ represents the smoothing factor, where σ ∈ [0, 1], and its value is 0.7. Using OFR and BU metrics together for managing congestion and load-balancing situations offers a beneficial approach. When used integrally, they enable an effective assessment of the network’s congestion state, enabling proactive and reactive measures. This perspective ensures more effective load balancing and congestion management.

4.2.4. Reminder Energy (RE)

It expresses the nodes’ reminder energy and is calculated as follows:

0 \leq {RE}_{i} = \frac{{Res}_{i}}{{Init}_{i}} \leq 1

(5)

Here

{R e s}_{i}

and

{I n i t}_{i}

represent the residual energy and the initial energy of node i, respectively.

4.2.5. Child Count

Here, the child count metric contributes to the parent selection process. It is a constraint metric utilized with other conditions to calculate the final reward used to find the related Q-value of the neighbors. The value of this metric is assigned empirically and depends on the type of application (more details in the reward function section).

4.3. The Q-Learning Algorithm

The Q-table is organized into rows and columns, where each row corresponds to a state (i.e., a node), and each column represents the possible actions (i.e., selecting other nodes as the next state). Consequently, the Q-table size is the number of nodes multiplied by the number of actions. However, since nodes in WSNs are resource-constrained, storing the entire Q-table is impractical, especially in dense networks. Therefore, Aris-RPL utilizes only a single row of the Q-table per node. Instead of maintaining the complete Q-table, each node stores a one-dimensional array sized according to its neighbors.

4.3.1. The Reward Function

Integrating the routing metrics mentioned above into the reward function requires specific preliminary steps. The proposed approach aims to provide an adaptive load-balancing and congestion solution while maintaining network resources. Consequently, routing metrics such as node buffer utilization and overflow ratio significantly affect the reward value and the neighbor Q-value. Therefore, high buffer utilization and overflow ratio levels will negatively impact the reward calculation. Thus, the following equations calculate these metrics:

{QFR}_{final} = \{\begin{matrix} {OFR}_{j} i f {OFR}_{j} \leq ε_{thr} \\ l n (1 - {OFR}_{j}) O t h e r w i s e \end{matrix}

(6)

{BU}_{final} = \{\begin{matrix} {BU}_{j} i f {BU}_{j} \leq δ_{thr} \\ l n (1 - {BU}_{j}) O t h e r w i s e \end{matrix}

(7)

where

ε_{thr} and δ_{thr}

are design parameters. They represent the thresholds after which the parent node is considered to experience high Overflow Ratio (OFR) and Buffer Utilization (BU) levels, respectively. Additionally, the RSSI metric must be normalized by adjusting its value to a standard scale so it can be combined linearly with the rest of the metrics. The proposed approach uses min-max normalization for this purpose. Consequently, to address the multi-objective nature of the routing requirement, neighboring nodes are rewarded based on their link/node conditions according to

{Reward}_{i, j} = {QFR}_{final} + {BU}_{final} + {RE}_{j} + \frac{RSS I_{i, j} - RSS I_{\min}}{RSS I_{\max} - RSS I_{\min}}

(8)

where

{Reward}_{i, j}

is the obtained reward when a node

i

selects node

j

as a preferred parent. Here, the reward function is designed intentionally to have a linear combination in Equation (8) for computational simplicity and low overhead, which is essential for resource-constrained nodes. Moreover, using a logarithmic scaling in Equations (6) and (7) serves as a critical safety constraint that only activates when a node approaches a failure state. This is an intended design choice to prioritize network reliability, congestion, and load balancing management. Specifically, even with the equal weighting in Equation (8), the weighting principles are applied implicitly through the interplay between linear Equation (8) and non-linear logarithmic penalties in Equations (6) and (7), allowing the protocol to prioritize link quality under normal conditions while automatically penalizing when a node approaches the predefined thresholds. For example, as the buffer ratio approaches its maximum value (e.g., BU to 1), the

\ln (1 - BU)

penalizes aggressively, providing a powerful, non-linear negative reward that discourages the selection of nodes nearing congestion. Moreover, using EWMA smoothing in Equation (4) ensures that while the system is highly sensitive to sustained congestion, it remains resilient to momentary spikes, which is a vital component of the approach’s implicit weighting strategy.

Now, depending on the status of the neighboring nodes in the network, additional rewards and penalties are assigned, which will be explained in detail. Aris-RPL considers

maximum reward

,

e x t r a r e w a r d

, and

p e n a l t y

values to handle various situations. This dynamism is essential to achieving the proposed approach’s goals. Specifically,

{R e w a r d}_{f i n a l}

will equal the

maximum reward

if the next step node

(N N)

is the root. The

extra reward

is applied if the next node is one hop from the root to encourage data packet forwarding toward the root.

Penalty

is used to avoid routing loops, sending data downward, and nodes that exceed the child counter threshold

(CC)

. The

CC

threshold value is determined based on the application and network conditions. A smaller

CC

value results in a more balanced DODAG graph. However, if the

CC

value is too small, it can negatively affect parent selection and QoS, as nodes might not be able to find an appropriate parent. In our simulation, we set the CC threshold to 5. This mechanism helps address the thundering herd problem by balancing the load, even when a new node with better conditions joins the network. Next, the calculated rank from Equation (8) is applied otherwise. Consequently, Equation (9) calculates the final reward. Then, according to Equation (1), the q-value is calculated. Furthermore, the maximum neighbor nodes’ q-value is appended to the DIO metric container to be sent and used by other nodes.

{Reward}_{final} = \{\begin{matrix} \begin{matrix} R_{\max} i f N N i s t h e r o o t \\ R + E x t r a i f N N i s o n e- h o p f r o m r o o t \\ P e n a l t y i f {NN}_{rank} > {Node}_{rank} o r C C i s m a x i m u m \\ R o t h e r w i s e \end{matrix} \end{matrix}\}

(9)

Finally, the node’s best parent is determined to be the neighboring node with the highest q-value.

4.3.2. Q-Learning Value Function

As mentioned earlier, the routing policy of the proposed approach works in three phases. This function builds the Q-table, and as explained in Section 2, the entry values in the Q-table, known as Q-values, are calculated and updated based on Equation (1). These Q-values reflect the penalties and benefits of taking a particular candidate parent node based on reinforcement signals received from neighboring nodes. In the proposed approach, DIO messages contain these signals, which are disseminated periodically according to the Trickle algorithm. The information needed to update the Q-values, including the maximum Q-value of the neighbor and the routing metrics, is appended in the DIO message. Therefore, the modified body of the DIO message includes RE, OFR, BU, and the maximum Q-value of the neighbor. Consequently, when receiving a DIO message, the node updates its Q-table according to appended information. It is worth noting that by propagating and using the maximum Q-value in Equation (1) with a discount factor of 0.7, Aris-RPL considers aggregated rewards beyond the one-hop neighborhood indirectly. Additionally, at the beginning of the network start-up, all nodes initialize the new entries of the Q-table for the first time, receiving DIO messages with:

Q_{i} ({neighbor}_{j}) = \frac{1}{1 + {rank}_{j}}

(10)

where

{rank}_{i}

represents the hop count of node

j

. As DIO messages are exchanged and time progresses, the q-values converge to their desired values. Algorithm 1 explains the Q-learning algorithm function. Upon receiving a DIO message from a neighbor node, the node first sets the learning rate, discount factor, and approach hyperparameters such as thresholds and normalization values, etc. (lines 1–3). The node extracts the routing metrics from the DIO message (line 4). Then, the algorithm normalizes the calculated RSSI and calculates OFR and BU according to (6, 7) (lines 7–9). Eventually, it calculates the reward based on the normalized metrics (line 10). Then, to update the Q-table, the node considers two conditions. When an unknown node sends a DIO message, this new neighbor is initialized and added to both tables of Q-values and the parent set, with a value assigned to it (lines 11–15). If the neighbor node is known, its Q-value is updated based on the current node and network status (lines 16–29). Finally, the node updates its routing metrics and identifies the highest Q-value among neighbors. This information is then embedded in the DIO message. DIO contains all the needed information about network conditions (lines 30–31). Network nodes execute the Q-function whenever they receive a DIO message. Consequently, nodes can select one of their neighbors as the next forwarder based on the selection function of the approach’s next phase.

Algorithm 1 The Q-learning Algorithm
	Input: Incoming DIO message from Neighboring j
	Output: Outgoing DIO message with Updated metrics, Update the Q-Table
1	Define: D_Factor $γ$ , L_Rate $α$ , Buffer Utilization threshold $δ_{t h r}$ ,
2	Overflow ratio Utilization threshold $ε_{t h r},$ RSSI_max, RSSI_min,
3	Max Child Count (CC), Extra, Penalty, R_max;
4	Define: Routing Metrics (BU, OFR, RE, CC);
5	Neighboring_Node N_j, Receiving_Node N_i
6	Begin:
7	RSSI_j ← Calculate RSSI_j;
8	Calculate OFR, BU according to (6) and (7);
9	Normalize RSSI_j;
10		Reward_j ← Calculate Reward (Routing Metrics);
11		if N_j is new then
12			Parent-Set[j].Parent_ID ← DIOj.Node_ID;
13			Q-Table[j].Parent_ID ← Parent-Set[j].Parent_ID;
14			Q-Table[j].Parent_qvalue ← QValue_Initial;
15		end
16		if N_j is not new then
17			if N_j is the Netwotk_root then
18			Q-Table[j].Parent_qvalue is updated with (R_max);
19			end
20			else if N_j is a root’s child then
21			Q-Table[j].Parent_qvalue is updated with Reward_j + Extra;
22			end
23			else if N_j.Rank >N_i.Rank or CC is maximum then
24			Q-Table[j].Parent_qvalue is updated with Penalty;
25			end
26			else
27			Q-Table[j].Parent_qvalue is updated with Reward_j;
28			end
29		end
30 31		/Outgoing DIO message contains Updated routing metrics, and the max neighboring Q-value;
32	exit:

4.4. Selection Algorithm

The first and second phases of the proposed approach are executed by this function, which represents the vital component of the new RPL protocol’s objective function. This function is invoked whenever a routing decision needs to be made. As previously discussed, a key consideration in reinforcement learning algorithms is balancing the exploitation and exploration phases.

A practical action selection strategy must be adopted to avoid falling into a locally optimal solution. During the exploration phase, the agent will explore all possible actions over several episodes and choose actions randomly. Aris-RPL uses the SoftMax selection strategy to balance exploration and exploitation. The SoftMax method offers a probabilistic approach that selects the actions based on their relative Q-values. It means that the value of the parent reward affects the probability of choosing the parent. The selection probability for an action

a_{i}

is given by:

{PR}_{i} (j) = \frac{e^{\frac{Q_{i} (j)}{θ}}}{\sum_{k \in N (i)} e^{\frac{Q_{i} (k)}{θ}}}

(11)

where

{PR}_{i} (j)

is the probability of selecting a node

j

by node

i

as a parent.

Q_{i} (j)

is the Q-value of neighbor node j.

θ

is an exploration parameter.

N (i)

represents the node i’s neighbor set. In the SoftMax selection strategy, when

θ

is large the action selection probabilities become more uniform, encouraging exploration. Conversely, when

θ

is small, the probabilities become more biased towards actions with higher Q-values, favoring exploitation. A common approach is to start with a higher

θ

to promote exploration and gradually decrease it over time to shift toward exploitation. This can be achieved using a decay schedule according to Equation (12). In our proposed approach, exploration is performed according to the number of time steps or episodes, where

episode

≤

{episode}_{\max}

. During exploration, Aris-RPL initiates

θ_{0}

with a large value to make selection probabilities more uniform. Then,

θ

is calculated according to Equation (12). Next, Equation (11) calculates the selection probability of nodes regarding their Q-values. Over time,

θ

decreases with the increase of parameter t:

θ = {θ_{0} \cdot e}^{(- μ t)}

(12)

where

θ_{0}

represents the initial

θ

value,

μ

is the decay rate, and

t

is the time step according to the

episode

variable value (for example, when episode = 3, then t = 3). The algorithm transitions to the exploitation phase when the

episode

variable reaches where

θ_{0}

represents the initial

θ

value,

μ

is the decay rate, and

t

is the time step according to the

episode

variable value (for example, when episode = 3, then t = 3). The algorithm transitions to the exploitation phase when the

episode

variable reaches the maximum value

{episode}_{\max}

. At this point, the algorithm chooses a parent based on the highest Q-value of all candidates.

Algorithm 2 outlines the selection algorithm used in the approach. Firstly, the algorithm defines the parameters (the initial value

θ_{0}

, decay rate

μ

, and

{episode}_{\max}

) (line 1). If the parent set is empty, the node explores its vicinity and broadcasts a DIS message to solicit DIO and avoid disconnection from DODAG (lines 3 to 5). Next, the exploration phase begins. During this phase, the exploration value is calculated according to (12) (line 7). The preferred parent is then selected probabilistically (line 8). After each iteration, the episode count is incremented by 1 (line 9). When the episode count reaches its maximum value,

{episode}_{\max}

, the algorithm transitions to the exploitation phase, where a parent with the highest Q-value is selected (lines 11–13). Afterward, the proposed approach is ready to execute the third phase.

Algorithm 2 Subsequent Node Selection Algorithm
	Input: Data Packet from a child node
	Parent-set and Q-table of the current node
	Output: Preferred next hop node (Parent)
1	Define: Init value $θ_{0}$ $, Dcay rate μ$ , episode_Max,
2	Begin:
3		if (Parent-set is Empty) then
4				broadcast a DIS message;
5		end
6		if episode ≤ episode_Max then //Exploration
7			$Calculate θ$ according to (12);
8			Preferred-Parrent ← Select from parent-set according to PRi(j) in (11);
9			episode++;
10		end
11	if episod > episod_Max then//Exploitation
12			Pmax ← Find Qvalue_Max(Parent-set, Q-Table);
13			return P_max;
14		end
15	exit: Selected preferred parent

4.5. Network Topology Monitoring Management Function

The third phase is executed during the exploitation after choosing the preferred parent to respond to potential and rapid network changes. As explained earlier, our reinforcement learning approach updates network information through DIO messages in the exploration phase and utilizes this information in the exploitation phase. Aris-RPL halts DIO transmission in the exploitation phase by resetting the Trickle timer period. This significantly reduces overhead without affecting performance. In standard RPL, the Trickle timer’s long intervals between consecutive DIO transmissions can result in inaccurate and outdated congestion information about the node’s vicinity. Aris-RPL ensures timely and accurate network updates, addressing frequent network changes effectively as follows: During the exploitation phase, each node starts a timer

T

to monitor the OFR metric. If the OFR value exceeds the congestion threshold

ε_{0}

during timer T, the node will send a DIO message to inform its child nodes of the new update, resulting in an update to its Q-value according to Algorithm 1 and returning to exploration. If OFR does not exceed

ε_{0}

, reset timer T and repeat the process. This observation timer T is numerically defined as the initial Trickle transmission interval (I_min = 4 s) in our experiments, ensuring the monitoring phase captures congestion with the same granularity as the protocol’s most responsive state. Noteworthy, based on the previous algorithms, the node only breaks the silence in the exploitation phase when a physical buffer accumulation is accompanied by a mathematically confirmed reliability crisis, initiating either a global or local repair, meaning resetting the Trickle timer to send DIO messages and disseminate the new information repeatedly. Resetting the timer also resets the episode parameter, activating the exploration phase.

Algorithm 3 explains the related function. Firstly, the algorithm defines the parameters (the Timer T and the congestion threshold

ε_{0}

) (line 1). While the node is in the exploitation phase, it starts a timer T and calculates the OFR and other metrics every T (lines 3–5). If the OFR value exceeds the congestion threshold

ε_{0}

, the node broadcasts a DIO message to inform child nodes of the new update (lines 6–8). Finally, if a local or global repair happens, reset the Trickle timer t and episode counter (lines 10–13). It is worth mentioning that the initialization time is set to 180 s, meaning this is the required time to startup network, to transit from uniform exploration to biased exploitation, and before starting to collect environmental data. Moreover, since there is an interaction between the topology monitoring phase and the Trickle algorithm, we have defined the upper bound on recovery delay when Aris-RPL triggers an immediate DIO broadcast upon detecting congestion as UB_D as the summation of T (the monitoring timer) and I_min (the minimum Trickle interval). As a result, the maximum recovery delay for disseminating a topology update is 8 s. Figure 6 illustrates the complete workflow of the proposed framework. It demonstrates the interdependencies between the different phases: (1) Building and updating the Q-table utilizing Algorithms 1 and 2, Exploration and Exploitation phases utilizing Algorithms 2 and 3, and the Topology Monitoring phase utilizing Algorithm 3. This comprehensive workflow demonstrates how the Q-learning agent builds a Q-table, transitions from exploration to optimal parent selection, and maintains network stability through reactive monitoring.

Algorithm 3 Topology Monitoring and Management
	Input: Data Packet from a child node,
	Q-table of the current node
	Output: DIO message, possible exploration activating
1	Define: Timer $T$ , congestion threshold $ε_{0}$
2	Begin:
3		while Exploitation do
4			Start TIMER T;
5			Calculate OFR, and other metrics every T;
6			if OFR ≥ $ε_{0}$ then
7				broadcast a DIO message with the last node metrics;
8			end exit;
9		end
10		if local or global repair then
11			Reset the Trickle timer t;
12			Reset episode counter;
13		end
14	exit:

Figure 6. The Complete Workflow of Aris-RPL.

5. System Setup and Results

Aris-RPL has been evaluated using the InstantContiki OS 3.0 64-bit with the GUI Cooja simulator environment [52] running on a physical workstation with an Intel Core i7 9th CPU, a 16-bit operating system, an NVidia GeForce GTX 1650 GPU, and 16 GB of RAM. This environment is widely accepted in many IoT studies [13]. One of its strengths is the integration of the MSPSim instruction-level emulator, which provides realistic simulations with accurate timing, especially for the famous Texas Instruments MSP430 microprocessor-based hardware platforms [53], besides using CSMA (Carrier Sense Multiple Access) methods in its Medium Access Control (MAC) layer [54] that mitigates interference through mechanisms such as collision avoidance. In addition, Cooja supports various sensor networking platforms, such as the Zolertia One (Z1) and Tmote Sky.

The Z1 platform was chosen in simulation scenarios due to its accurately emulated model in Cooja, which ensures more realistic IoT scenarios and precise results. The major specifications of the Z1 platform are shown in Table 2 [55]. It is worth noting that the Z1 platform contains the CC2420 module, an IEEE 802.15.4-compliant RF transceiver developed by Texas Instruments. This module is designed for low-power and low-voltage wireless applications and is widely used in real-world implementations [56]. The study introduced in this paper has considered various topology scenarios for extensive evaluations. These scenarios were designed to reflect critical aspects the approach needs to address: density, traffic rates, and path diversity.

Evaluating these aspects is essential to demonstrating the strength of the Aris-RPL for selecting more load-balanced network paths. Accordingly, we evaluated three scenarios with different node densities: 50, 75, and 100 nodes. The network area in each scenario is 100 × 100 m² with randomly located nodes. Each scenario was evaluated with two different traffic rates: 5 packets per minute and 10 packets per minute, in which all nodes create the traffic. Moreover, each scenario was executed for 5 independent iterations, with a simulation duration of 3600 s (1 h) per run, different random setups, and 95% Confidence Intervals. Additionally, for random topology, nodes were placed using a random distribution in each of the 5 iterations, creating different layouts with bottleneck points for every run. The complete simulation settings and the list of parameter meanings of Aris-RPL are detailed in Table 3, respectively. Additionally, to demonstrate its effectiveness, the following approaches are implemented and used in evaluation scenarios:

RPL- MRHOF represents the original version of RPL. Its objective function uses the ETX link quality metric as the selection criterion of the preferred parents [21]. The comparative analysis and performance evaluation with RPL-MRHOF allow us to highlight the improvements achieved by Aris-RPL over traditional RPL.
A Learning-Based Resource Management for Low Power and Lossy IoT Networks [9], which is referred to as MAB in later sections. It is a machine learning approach that operates on distributed nodes within the Contiki/RPL framework like Aris-RPL, and utilizes the multiarmed bandit (MAB) technique to optimize performance in dynamic IoT networks. According to the MABs authors, it has demonstrated superior results against load-balancing and congestion-aware RPL enhancements such as [26,27]. Algorithms 4 and 5 explain the MAB approach used in this study, and the related simulation settings are detailed in Table 4.
A congestion-aware routing algorithm in Dynamic IoT Networks [41], which is referred to as CRD in later sections. It is a Q-learning approach for dynamic IoT networks under heavy-load traffic scenarios. CRD aligns closely with the reinforcement learning principles used in Aris-RPL. It adopts the Q-learning algorithm at each node to learn an optimal parent selection policy to tackle the load-balancing challenges in RPL networks. Algorithms 6 and 7 explain the MAB approach used in this study, and the related simulation settings are detailed in Table 5.

Algorithm 4 Multi-armed Bandit-learning-based algorithm MAB
	Input: Neighbor set N(x) for each node x that contains n neighbors.
	Output: Update the Q-Table
1	Define: L_Rate $α$ , r+, r−, Neighbor j;
2		re_limits_max = 3
3		ba-off_stages_max = 5
4	Define: CWmin = 0, CWmax = 31
5		c_reward = 0, Q_n(a) = 0, Q_n+1(a) = 0
6	Begin:
7		ETX_j ← Calculate ETX using neighbor_link_callback ();
8		EEX_j ← Calculate EEX from ETX;
9		Evaluate reward;
10		if C_EEX ≤ Pr_EEX then reward = r+
11		else reward = r−
12		For Action (a) update r_table;
13		Using the following update the Q-values table
14		$Q_{n + 1} (a) = {(1 - α)}^{n} Q 1 + \sum_{i = 1}^{n} α {(1 - α)}^{n - i} r_{i}$
15		end if
16	End
17	exit:

Algorithm 5 MAB Selection algorithm
	Input: Data Packet from a child node
	Output: Preferred next hop node (Parent)
1	Define: I = I_min, counter c = 0, Root = R, Node = n, Parent = p;
2	Begin:
3		set I = I × 2
4		if I_max ≤ I then I = I_max
5		end if
6		if exploitation then
7			Select parent y with minimum Q-value ∀ y ∈ N(x)
8			Suppress DIO transmissions
9		end if
10		if exploration then
11			if n = R then R rank = 1
12			end if
13			if p = null then rank = pathank_max
14			end if
15			if p != null then rank = h + Rank(pi) + r_increase
16			r_increase = EEX
17			end if
18			return MIN (Baserank + r_increase)
19			embed r_increase in DIO
20			t_timer = random [I/2, I]
21			if (network is stable) then counter ++
22			else
23			I = I_min
24			if (t_timer expires) then broadcast DIO
25			end if
26		end if
27	end
28	exit: Selected preferred parent

Algorithm 6 Congestion-Aware Routing using Q-Learning (CRD)
	Input: Incoming DIO message from Neighboring j
	Output: Outgoing DIO message with Updated metrics, Update the Q-Table
1	Define: BF_th, L_Rate $α$ , η;
2	Define: Routing Metrics (BF_j, HC_j);
3	Neighboring_Node N_j, Receiving_Node N_i
4	Begin:
5	ETX_j ← Calculate ETX_j;
6	Decode BF_j and HC_j from the received DIO message;
7	$λ (y) =$ max (BF_j/BF_th, 1 − BF_j/BF_th)
8		R_j ← $λ (y) \times$ BF_j + ETX_j + HC_j
9		if N_j is new then
10			Parent-Set[j].Parent_ID ← DIOj.Node_ID;
11			Q-Table[j].Parent_ID ← Parent-Set[j].Parent_ID;
12			Q-Table[j].Parent_qvalue ← QValue_Initial;
13		end
14		if N_j is not new then
15			Q-Table[j].Parentqvalue $\leftarrow$ Qold(j) + $α$ [R_j − Qold(j)];
16		end
17		BF_i	← Calculate BF_i
18		HC_i	← Calculate HC_i
19		Encode BF_i and HC_i in the Outgoing DIO message
20		/Outgoing DIO message contains updated routing metrics;
21	exit:

Algorithm 7 CRD Selection Algorithm
	Input: Data Packet from a child node
	Parent-set and Q-table of the current node
	Output: Preferred next hop node (Parent)
1	Define: exploration factor $θ$ , Consecutive q_losses $\emptyset$ , $\emptyset_{0}$ , Timer X, Imin
2	Begin:
3	if (Parent-set is Empty) then
4	broadcast a DIS message; end
5	Compute selection probabilities using:
5	$Px (y) = 1 - \frac{e^{Qx (y) / θ}}{\sum_{k \in N (y)} e^{Qx (k) / θ}}$
6	Select the preferred parent y with the highest probability Px(y)
7	Reset Trickle Timer:
8	Reset the Trickle timer to Imin if the node detects $\emptyset$ consecutive queue losses
9	Increase $\emptyset$ by $\emptyset_{0}$ after each reset to limit overhead
10	Reinitialize timer values if no losses occur within interval X
11	exit: Selected preferred parent

Table 2. Z1 Sensor Node Specifications [55].

Parameter	Value
Micro-Controller Unit (MCU)	2nd MSP430 generation
Architecture	16-bit RISC (Upgraded to 20 bits)
Radio Module	CC2420
Operating MCU Voltage Range	1.8 V < V < 3.6 V
CC2420 Voltage Range	2.1 V < V < 3.6 V
Operating Temperature	−40 °C < θ < +85 °C
Off Mode Current	0.1 µA
Radio Transmitting Mode @ 0 dBm	17.4 mA
Radio Receiving Mode Current	18.8 mA
Radio IDLE Mode Current	426 µA

Table 3. Simulation Settings.

Parameter	Value
Area	100 × 100 m²
Nodes’ Number	50, 75, 100
Sink Number	1
Radio Channel Model	UDGM Distance Loss
Com./Interference Range	30 m/40 m
Traffic Rate	5 ppm, 10 ppm
α, γ	0.4, 0.7
$ε_{thr}, δ_{thr}$	0.3, 0.7
Penalty	−3.0
R_Max, R_extra	100, 5.0
Max Child Count CC	5
$θ_{0}$ , $μ$ , $ε_{0}$	2.0, 0.2, 0.3
Episode_max	10
Initialization time	180 s
Simulation Duration	3600 s
Simulation Speed	No speed limit
Simulation Iteration Number	5 times

Table 4. MAB Main Simulation Parameters.

Parameter	Value
PHY and MAC protocol	IEEE 802.15.4 with CSMA/CA
Radio model Unit disk graph medium (UDGM)	UDGM—Radio model Unit disk graph medium
Buffer size	4 packets
UIP payload size	140 bytes
Imin	10
Imax	8 doubling
Initial reward = 0	0
Initial Q₁(a) = 0	0
Learning Rate (α)	0.6
Reward Function	+1 (improved EEX), −1 (worsened EEX)
Maximum retry limit	3
Maximum Backoff stage	5
Number of stored reward values for action a	5

Table 5. CRD Main Simulation Parameters.

Parameter	Value
PHY and MAC protocol	IEEE 802.15.4 with CSMA/CA
Radio model	UDGM
Learning rate (α)	0.3
Congestion threshold (BFth)	0.5
η the positive integer to enable decoding correctly	100
Exploration factor (θ)	2
$\emptyset$ defines the number of consecutive queue losses of a node	2
Timer X	100 ms
Imin	3 s

It is important to emphasize that for all the protocols, the Trickle Timer Settings are the same. All protocols were tested under strictly identical physical and network conditions using the same random topology seeds, and they operated over the same 6LoWPAN/IPv6 and CSMA MAC layers. Finally, the following performance metrics have been considered in the experiments: (1) Packet Delivery Ratio, (2) Control Traffic Overhead, (3) Energy Consumption, and (4) End-to-End Delay. The details of the evaluation results are discussed in the following:

5.1. Packet Delivery Ratio (PDR)

PDR reflects the routing policy reliability, especially in dense network scenarios, and represents the percentage of successfully received packets during the network lifetime. This performance metric is calculated based on the following equation.

PDR = \frac{{No . Pack}_{delivered}}{{No . Pack}_{total}} \times 100

(13)

where

{P a c k}_{d e l i v e r e d}

is the total number of packets successfully delivered to the root, and

{P a c k}_{t o t a l}

is the total number of packets transmitted during the simulation by the IoT nodes. Figure 7 demonstrates that the proposed approach outperforms its counterparts concerning PDR. For instance, in the first scenario with 50 nodes and a traffic rate of 5 ppm, the proposed approach achieved the highest PDR at 92%. Here, compared to MRHOF, MAB, and CRD, PDR was improved by 15%, 5%, and 4%, respectively.

Figure 7. PDR Results in Different Network Scenarios.

On the other hand, as network size and traffic rate increase, the PDR for all approaches declines due to rising congestion and overflow levels. However, our proposed approach consistently performed well across all scenarios compared to other methods. MRHOF, despite using a probing mechanism to measure link quality, exhibited the lowest PDR in all scenarios. This is because the original objective functions in the RPL protocol lack a mechanism to deal with load imbalance and uneven traffic loads. Both CRD and MAB showed almost similar results for all scenarios, benefiting from intelligent routing policies, with CRD having a slight distinction due to an effective learning load-balancing process. Actually, all terms (the linear terms (RSSI, Energy) and logarithmic congestion penalties (OFR and BU)) in the linear design of the reward function work together as a constrained optimization. The linear terms determine the optimal path, subject to the constraints imposed by the logarithmic penalties. This is an intentional design choice to prioritize network reliability, congestion, and load balancing management. In this case, a path is only considered optimal if it is both high-quality and safe. Moreover, reducing the number of control packets during the exploitation phase significantly lowers channel contention and the probability of collisions. According to Figure 7, the proposed method, on average, enhanced PDR for all scenarios compared to MRHOF, MAB, and CRD by 51%, 15%, and 11%, respectively.

5.2. Control Traffic Overhead

RPL protocol uses, as explained in Section 2, ICMPv6 control messages during the network lifetime to construct and maintain the network DODAG, which involves exchanging DIO, DAO, and DIS messages according to the Trickle timer mechanism. These messages are essential for maintaining network stability and responding to network dynamics and node/link metrics changes. Furthermore, the number of these messages directly impacts energy consumption, and it is a crucial metric indicating the protocol’s ability to handle network conditions such as congestion, load balancing, and unstable radio links. Equation (14) calculates the control overhead metric. Where

n, m,

and

k

are the numbers of all DIO, DAO, and DIS messages, respectively:

Control Overhead = \sum_{i = 1}^{n} {DIO}_{i} + \sum_{i = 1}^{m} {DAO}_{i} + \sum_{i = 1}^{k} {DIS}_{i}

(14)

According to the results in Figure 8, all approaches have shown an increasing trend in control overhead numbers as the number of nodes rises, which is expected. MRHOF has a higher overhead percentage than other approaches.

Figure 8. Control Overhead Results in Different Network Scenarios.

It is worth mentioning that despite the high number of control overheads, MRHOF did not achieve a significant increase in PDR, indicating that much network traffic is wasted on transmitting and receiving control overheads without effectively responding to network dynamics. The primary reason is that MRHOF cannot address load-balancing and congestion issues, leading to a highly unstable network. Surprisingly, CRD showed the highest control overhead among all approaches, especially under heavy load, as illustrated in the third scenario. CRD, based on its methodology, resets the Trickle timer more frequently to respond to sudden network changes, resulting in more overhead exchanges. However, this increased overhead led to an improvement in PDR results. On the other hand, MAB managed to reduce control overhead compared to MRHOF and CRD without degrading performance due to DIO suppression during the exploitation phase. However, the MAB routing policy did not introduce a mechanism to quickly respond to rapid network changes during the exploitation phase, affecting its PDR results in more complex scenarios. In contrast, Aris-RPL demonstrated moderate control overhead with high PDR outcomes due to the precise selection of link/node metrics considering congestion and load-balancing status, applying the DIO suppression during the topology monitoring and management phase. According to Figure 8, the proposed method, on average, enhanced the control overhead for all scenarios compared to MRHOF, MAB, and CRD by 43%, 23%, and 51%, respectively.

5.3. Energy Consumption

According to [57], the node consumes most of the energy resources in communication activities during packet transmission. Consequently, the nodes far from their parents need to utilize more energy. The transmission energy is the required energy to transmit k-bit data to a node at a distance

d

. It is calculated as follows:

E_{T} (k, d) = E_{ele} \times k + E_{amp} \times k \times d^{p}

(15)

where

E_{ele}

is the required energy to run transmitter circuits,

E_{amp}

is the required energy to transmit amplifiers, and

p

is the path loss index. The typical values of these parameters are

E_{ele} = 50 nJ / bit,

E_{amp} = 100 pJ / bit / m^{2},

and

2 \leq p \leq 4,

respectively [9]. In our proposed approach, according to [58], the total energy consumption in every node consists of four communication modes: (1) CPU mode, (2) Transmission mode, (3) Receiving mode, and (4) Low Power mode. Contiki OS uses the

Energest ()

module to estimate a node’s power consumption by providing the accumulated time the sensor node spends in various communication modes. The following formulas are used to measure the energy consumption in each state [9]:

\begin{matrix} E_& L P M = (E n e r g e s t_L P M \times 0.020 \times 3) / 32, 768 \\ E_{C} P U = (Energes t_{C} PU \times 0.426 \times 3) / 32, 768 \\ E_{T} x = (Energes t_{T} x \times 17.4 \times 3) / 32, 768 \\ & E_R x = (E n e r g e s t_R x \times 18.8 \times 3) / 32, 768 \end{matrix}

(16)

The

Energest ()

model in Contiki OS has the

e n e r g e s f l u s h ()

function that provides the value of

{energest}_{L} PM

,

{energest}_{C} PU

,

{energest}_{T} x

, and

{energest}_{R} x

variables in (16), respectively. The numbers in these equations are according to the Z1 mote standard specifications [59]. For instance, 17.4 mA is the current consumption required to run the transmission unit in the CC2420 module of Zolertia 1 mote, which consumes 3 V, and 32,768 is the tick-per-second value of the Z1 mote. The total energy consumption for each node is calculated by summing the energy consumed according to (16) during the simulation. To analyze the results shown in Figure 9, we observe that the control overhead and PDR results directly impact the nodes’ total energy consumption, as communication is the primary energy-consuming activity.

Figure 9. Total Energy Consumption Results in Different Network Scenarios.

Generally, energy consumption increases with the number of successfully delivered data packets and the transmitted control overhead. For instance, when the network becomes denser and congested, leading to load-balancing issues and instability, resulting in more overhead transmissions. Therefore, effectively handling these concerns is necessary.

In all scenarios, MRHOF and CRD consumed the highest amount of energy. MRHOFs high energy consumption is due to its probing mechanism and inability to address congestion and load balancing. CRDs high energy consumption results from frequent Trickle timer resets and PDR results. In comparison, MAB reduced the total energy consumption but was not as effective as the Aris-RPL because MAB relies on the ETX metric in its routing criteria, which consumes energy while probing.

Aris-RPL not only intelligently integrates well-selected link/node metrics (including energy metrics) to select the subsequent forwarder nodes but also quickly responds to network changes without resetting the Trickle timer. This is important since in low-power wireless protocols, energy consumption is dominated by the radio’s startup, which occurs regardless of packet size. As a result, reducing the frequency of control overhead during the exploitation phase and not utilizing the ETX metric leads to a higher Packet delivery ratio and lower control overhead, meaning that most energy consumption is used during data packet transmissions. According to Figure 9, the proposed method, on average, enhanced energy consumption for all scenarios compared to MRHOF, MAB, and CRD by 42%, 22%, and 50%, respectively.

5.4. E2E Delay

End-to-end delay represents the time a packet takes from the source to reach the final destination (the root) [16]. The total delay is the summation of differences between the sending and receiving times for all data packets being successfully delivered. The average delay is obtained by dividing the total delay by the total number of received packets. The following equations calculate the delay:

Total Delay = \sum_{k = 1}^{n} ({Receive}_{time} (k) - {Sent}_{time} (k))

(17)

Average Delay = \frac{Total Delay}{Total Packets Received}

(18)

According to Figure 10, as network density and traffic load increase, the average delay for all approaches rises due to increased packet buffering and queuing time. In the first scenario, MRHOF showed a higher delay due to its continuous probing mechanism for good links and improper congestion and load balancing mechanisms. In contrast, MAB could decrease MRHOF’s delay through the intelligent learning method. However, the Aris-RPL and CRD showed similar end-to-end (E2E) delays, with the proposed method having a slightly lower delay. This trend continues in the second scenario, where the proposed method consistently demonstrates the lowest average E2E delay compared to the other approaches, and according to the second scenario results in Figure 10, the proposed method, on average, enhanced the delay compared to MRHOF, MAB, and CRD by 4%, 6%, and 6%, respectively.

Figure 10. E2E Delay Results in Different Network Scenarios.

However, the proposed method started showing a higher E2E delay in the second scenario (75 nodes with 10 ppm) and continued in the third scenario. As previously explained, Aris-RPL is functionally designed to prioritize network reliability, congestion management, and load balancing. As a result, this is an intentional architectural trade-off designed to maximize the reliability in high-traffic scenarios. More specifically, by penalizing congested parents, the selection algorithm chooses nodes with additional hops, providing more stable routes at the expense of a marginal increase in delay. This observation aligns with the PDR results of the proposed method, as the total delay is directly related to the number of received packets. The exact observation for the CRD delay results, except for the PDR result, is less than that of the proposed method. More specifically, this design trade-off ensures that Aris-RPLs focus is primarily on congestion-aware and load-balancing routing, prioritizing PDR and load balancing over latency by actively avoiding congested nodes and selecting more stable paths in congested scenarios based on its metrics in the learning-based objective function.

5.5. Computational Complexity Analysis

Aris-RPL utilizes a single-row Q-table per node instead of a complete table. Each row corresponds to the node’s neighbors, minimizing memory requirements significantly for resource-constrained devices like the Z1 mote, which has 92 KB of flash memory ROM and 8 KB of RAM, even in dense networks. This approach ensures that nodes handle only the subset of neighbors relevant to routing decisions. However, the number of neighboring nodes directly affects the computational complexity of the proposed routing mechanism during the next-hop parent selection process. Each node updates its Q-values based on received DIO messages, evaluates neighboring nodes (as per Algorithms 1 and 2), and periodically monitors node conditions (as per Algorithm 3). The overall complexity of the proposed framework is O(n), where

n

represents the number of neighbors. More specifically, when a node receives a DIO from a neighboring node, it updates the Q-value for that neighbor based on specific routing metrics. This is a local decision-making action performed at that moment. Moreover, if a node has n neighbors, each evaluation and Q-value update involves n operations, reflecting a linear scaling of computational cost relative to the node’s local environment size, rather than the overall network size. Moreover, Algorithm 3 monitors the node’s internal congestion metrics (buffer utilization and overflow ratio) with O(1) complexity.

Consequently, the overall complexity is O(n) because upon detecting congestion, the node must interact with its n neighbors by either broadcasting updates or re-evaluating the parent set to find a more stable path.

Therefore, this O(n) complexity ensures that the Z1 mote’s performance remains efficient, even in large networks (>100 nodes), without affecting the decision-making process, given that the number of immediate neighbors

n

remains physically limited by radio range and the well-designed distribution of nodes inside the network layout. More specifically, the primary challenge in such scenarios is network control overhead, which our three-phase mechanism is designed to mitigate. However, assessing Aris-RPL performance in such large networks as future work is worth noting.

Considering RAM usage, as shown in Table 6, Aris-RPL introduces a 2.3% increase in ROM size. This is justified by the implementation of the Reinforcement Learning logic, the multi-objective reward function, and the non-linear penalty calculations. Even with this addition, the total firmware size (52 KB) is well below the 92 KB limit of the Z1. Also, Aris-RPL introduces a 0.5% increase in RAM size. The reason for Aris-RPLs low RAM overhead is that Aris-RPL maintains only a single row corresponding to the immediate neighbors, and when a DIO is received, the node processes the metrics immediately to calculate, update, and store the scalar Q-value for that neighbor.

Table 6. ROM and RAM usage of all the Protocols.

5.6. Protocol Implementation of Aris-RPL

The implementation of Aris-RPL builds upon modifications to the ContikiRPL implementation in the Contiki OS. The proposed protocol integrates a Q-Learning-based framework, introducing an adaptive routing mechanism. To achieve this, the implementation involves key steps:

First, Aris-RPL extends the DIO message structure to include the additional metrics essential for routing decisions, including Buffer Utilization, Overflow Ratio, Child Count, Energy level, and Max Q-value. Aris-RPL adds a total of 8 bytes to the DIO payload to support its multi-objective learning, adhering properly to the 127-byte IEEE 802.15.4 MTU limit with no need for packet fragmentation. More specifically, the total length of the DIO message, considering the standard DIO compressed packet [18,60,61], is 49 bytes containing (17 bytes for header compression information, checksum fields, and ICMP type, 24 bytes related to RPL information including instance, rank, DODAG configuration, and route information, 8 bytes for the Aris-RPLs metrics). Second, upon reception of a DIO message, a modification needs to be made to extract and parse these additional metrics from the buffer. Third, the DIO process functions in ContikiRPL processes this information according to Algorithm 1.

Afterward, to continue disseminating node metrics throughout the network, a modification needs to be made to include the additional fields in outgoing DIO messages. This ensures that neighboring nodes receive real-time information about the network state, supporting dynamic decision-making. Following this, in routing decisions, RPL invokes the objective function that was modified according to Algorithms 2 and 3 to evaluate candidate parents based on the Q-values computed using Algorithm 1. Each node maintains a Q-table related to its neighboring nodes, which is updated iteratively based on rewards derived from received network metrics in the incoming DIOs. This enables nodes to refine their routing decisions over time, ensuring that routing decisions adapt to dynamic changes in traffic patterns and network topology.

The same coding logic was employed in both CRD and MAB implementations, each utilizing different metric compositions, reward functions, Q-value update functions, and additional optimization techniques to enhance performance further. For instance, the CRD approach uses Buffer Utilization, link quality, and hop distance metrics to calculate the reward and update the q-values, while MAB combines ETX with energy consumption in one metric called the EEX Metric to calculate the reward and update q-values. compared to the other methods used for comparison. Figure 11 outlines the required and detailed modification steps to the ContikiRPL implementation inside the InstantContiki 3.0 environment. These steps offer a clear pathway to reproduce the mechanism described in the paper.

5.7. Comparative Study

In this section, we provide a structured comparison between Aris-RPL and established baseline routing approaches. The comparison is based on the performance metrics obtained during our large-scale simulations (75–100 nodes) and is summarized in Table 7.

Table 7. A comparative discussion of the key characteristics associated with the methods used for comparison.

Characteristic	RPL-MRHOF	CRD	MAB	Aris-RPL
Objective Function Criteria	The standardized RPL objective function with a focus on minimizing ETX	Multi-metric Q-learning-based optimal parent selection to address congestion and load-balancing in dynamic networks	Multi-armed bandit-based adaptive parent selection in dynamic networks with a focus on energy efficiency	Three-phase Q-learning framework with an effective hybrid monitoring mechanism. Using a composite reward metric.
Complexity per Node	O(n) (number of neighbors)	O(n) (number of neighbors)	O(n) (number of neighbors)	O(n) (number of neighbors)
Performance in Large Networks	Good-Low PDR, High C. Overhead, High E. Consumption, Moderate E2E delay	Good PDR, High C. Overhead, High E. Consumption, Moderate E2E delay	Good PDR, Moderate C. Overhead Moderate E. Consumption, Moderate-High E2E delay	Good PDR, Moderate C. Overhead, Low-Moderate E. Consumption, Moderate-High E2E delay

Figure 11. A General Flow of Aris-RPL Execution and the Modifications in ContikiRPL.

While all evaluated protocols maintain a computational complexity of O(n), they differ significantly in their decision-making logic; RPL-MRHOF is Limited to minimizing ETX, leading to frequent parent switches in congested situations. On the other hand, CRD and MAB introduce adaptive selection but often focus on singular goals such as congestion or energy rather than the holistic objective function. Whilst Aris-RPL implements a three-phase Q-learning framework with a hybrid monitoring mechanism. This allows for a composite reward metric that balances PDR, Energy, and Congestion simultaneously without increasing the complexity. Regarding Performance, as the network density increases to 100 nodes, the limitations of baseline protocols become evident. More specifically, our analysis shows three key advantages for Aris-RPL: (1) Reliability, where Aris-RPL maintains a Good PDR (>80%) consistently and also both CRD and MAB, whereas standard RPL-MRHOF drops into the low category (<50%) under high traffic loads due to its inability to proactively respond to bottlenecks. (2) Efficiency, where Aris-RPL achieves a Moderate Control Overhead (5–10), outperforming MAB and both CRD and standard RPL, which achieve high overhead (>10). (3) Energy Consumption Efficiency, where Aris-RPL achieves moderate energy consumption, outperforming the baseline approaches. On the other hand, regarding the Latency, there was a Trade-off, where Aris-RPL exhibits a Moderate-High E2E Delay (>2.5 s). This is a necessary trade-off for its load-balancing capability. By avoiding congested paths and selecting appropriate alternative multi-hop paths.

5.8. Sensitivity Analysis

Neither hyperparameter α nor γ was arbitrary; these values were selected empirically. More specifically, since reliability is a main focus of the proposed framework, the sensitivity analysis was conducted across all of our experimental configurations, including 50, 75, and 100 nodes with traffic loads of both 5 and 10 ppm regarding the PDR metric. The results presented in Figure 12 represent the average stability across these varying densities and loads. To this end, as shown in Figure 12, we tested α in [0.1, 0.7]. The close results show that for all values between 0.3 and 0.6, the variance was minimum. Also, we tested γ in [0.4, 0.9] and noticed that the results were close to each other between 0.5 and 0.8. The values of 0.4 for the learning rate and 0.7 for the discount factor confirm that the multi-objective reward function provides enough stability and the system converges to an effective policy.

Figure 12. Total Aris-RPL Sensitivity Analysis for PDR Metric.

6. Conclusions and Future Work

Managing load-balancing and congestion in wireless sensor networks (WSNs) significantly impacts performance. This is especially important as WSNs become denser and have heavier traffic patterns. RPL is the de facto routing policy for these environments, and with the rapid development of IoT applications, incorporating learning and automatic adaptation capabilities into RPL routing functionality is highly valuable for handling the complexity of future IoT applications. Accordingly, this study introduces Aris-RPL, a reinforcement learning-based RPL routing policy using a Q-learning algorithm for future IoT networks. It utilizes specific link/node metrics, including RSSI, buffer utilization, overflow ratio, child count, and energy level. These metrics reflect network load, overflow status, and node conditions, enabling intelligent responses to network dynamics. Each node uses these metrics to calculate the Q-values of neighboring nodes, indicating the penalties or benefits of selecting those, allowing nodes to choose more stable candidates as their preferred parents. The extensive simulation experiments have verified Aris-RPL efficiency in terms of Control Overhead, PDR, E2E Delay, and Energy Consumption results compared to other counterparts.

Aris-RPL operates in three phases: exploration, exploitation, and topology monitoring to maintain network performance effectively. Simulation results using the Contiki OS/COOJA simulator, which offers accurate insights into protocol behavior, showed that Aris-RPL outperforms other approaches in energy consumption, packet delivery ratio (PDR), control overhead, and relative end-to-end delay.

However, future work will involve deploying Aris-RPL to assess its performance on real-world testbed challenges, such as WSNs in industrial IoT scenarios. To address such challenges, one direction involves integrating cross-layer optimization mechanisms by allowing the Aris-RPL in the network layer to access information from other layers to influence the routing decisions more. For instance, incorporating MAC-layer metrics such as collision probability and interference levels into the reward calculation at the routing layer to make more informed decisions about forwarding paths under interference.

Additionally, dynamically adjusting the weights of routing metrics in the reward function for each candidate parent based on real-time network conditions and device specifications (e.g., hardware and computational abilities, etc.) will be explored to enhance adaptability to varying network conditions. Although Aris-RPLs focus is primarily on congestion-aware and load-balancing routing, another direction includes incorporating multi-hop metrics, such as cumulative energy consumption and end-to-end delay, to address Aris-RPLs delay limitation and improve the decision-making for more holistic routing strategies. Furthermore, advanced techniques like Deep Q-learning (DQL) and Multi-Agent Reinforcement Learning (MARL) could be applied to enhance protocol reliability. Such future studies aim to validate Aris-RPLs adaptability, ensuring its effectiveness in real-world scenarios.

Author Contributions

Conceptualization, N.H.; methodology, N.H. and Y.D.; software, N.H., A.A., and Y.D.; validation, A.A. and Y.D.; formal analysis, N.H.; investigation, N.H.; resources, N.H.; data curation, N.H. and Y.D.; writing—original draft preparation, N.H.; writing—review and editing, N.H., A.A., and Y.D.; visualization, N.H.; supervision, A.A. and Y.D.; project administration, A.A. and Y.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors gratefully acknowledge the support of the K. N. Toosi University of Technology, Faculty of Computer Engineering, for their support and access to research facilities.

Conflicts of Interest

The authors declare that they have no conflicts of interest. The authors agree to the final version of the manuscript.

Abbreviations

IETF	Internet Engineering Task Force
RPL	Routing Protocol for Low Power and Lossy Networks
LLN	Low-Power and Lossy Network
WSN/IoT	Wireless Sensor Network/Internet of Things
RL	Reinforcement Learning
PDR	Packet Delivery Ratio
E2E	End-to-End Delay
RSSI	Received Signal Strength Indicator
DAG	Directed Acyclic Graph
DODAG	Destination-Oriented Directed Acyclic Graph
DIO	DODAG Information Object
DAO	Destination Advertisement Object
DAO-ACK	Destination Advertisement Object-Acknowledgment
DIS	DODAG Information Solicitation
OF	Objective Function
MRHOF	Minimum Rank with Hysteresis Objective Function
HC	Hop Count
QU	Queue Utilization
ETX	Expected Transmission Count
BU	Buffer Utilization
OFR	Overflow Ratio
EWMA	Exponentially Weighted Moving Average
RE	Reminder Energy
CC	Child Count
I_min	Minimum Trickle Algorithm Interval
UB_D	Ubber Bound on Recovery Delay
PPM	Packet Per Minute
ROM/RAM	Read-Only Memory/Random Access Memory

References

Qiu, T.; Chi, J.; Zhou, X.; Ning, Z.; Atiquzzaman, M.; Wu, D.O. Edge Computing in Industrial Internet of Things: Architecture, Advances and Challenges. IEEE Commun. Surv. Tutor. 2020, 22, 2462–2488. [Google Scholar] [CrossRef]
Abdulsattar, N.F.; Abbas, A.H.; Mutar, M.H.; Hassan, M.H.; Jubair, M.A.; Habelalmateen, M.I. An Investigation Study for Technologies, Challenges and Practices of IoT in Smart Cities. In Proceedings of the 2022 5th International Conference on Engineering Technology and Its Applications (IICETA), Al-Najaf, Iraq, 31 May–1 June 2022; IEEE: Al-Najaf, Iraq, 2022; pp. 554–557. [Google Scholar]
Alshehri, F.; Muhammad, G. A Comprehensive Survey of the Internet of Things (IoT) and AI-Based Smart Healthcare. IEEE Access 2021, 9, 3660–3678. [Google Scholar] [CrossRef]
Ibrahim, O.A.; Sciancalepore, S.; Di Pietro, R. MAG-PUFs: Authenticating IoT Devices via Electromagnetic Physical Unclonable Functions and Deep Learning. Comput. Secur. 2024, 143, 103905. [Google Scholar] [CrossRef]
Ni, Q.; Guo, J.; Wu, W.; Wang, H. Influence-Based Community Partition with Sandwich Method for Social Networks. IEEE Trans. Comput. Soc. Syst. 2023, 10, 819–830. [Google Scholar] [CrossRef]
Rehman, S.; Tu, S.; Rehman, O.; Huang, Y.; Magurawalage, C.; Chang, C.-C. Optimization of CNN through Novel Training Strategy for Visual Classification Problems. Entropy 2018, 20, 290. [Google Scholar] [CrossRef] [PubMed]
Darabkh, K.A.; Al-Akhras, M.; Zomot, J.N.; Atiquzzaman, M. RPL Routing Protocol over IoT: A Comprehensive Survey, Recent Advances, Insights, Bibliometric Analysis, Recommendations, and Future Directions. J. Netw. Comput. Appl. 2022, 207, 103476. [Google Scholar] [CrossRef]
Darabkh, K.A.; Al-Akhras, M. RPL over Internet of Things: Challenges, Solutions, and Recommendations. In Proceedings of the 2021 IEEE International Conference on Mobile Networks and Wireless Communications (ICMNWC), Tumkur, Karnataka, 3–4 December 2021; IEEE: Tumkur, Karnataka, India, 2021; pp. 1–7. [Google Scholar]
Musaddiq, A.; Ali, R.; Kim, S.W.; Kim, D.-S. Learning-Based Resource Management for Low-Power and Lossy IoT Networks. IEEE Internet Things J. 2022, 9, 16006–16016. [Google Scholar] [CrossRef]
Musaddiq, A.; Zikria, Y.B.; Zulqarnain; Kim, S.W. Routing Protocol for Low-Power and Lossy Networks for Heterogeneous Traffic Network. J. Wirel. Com. Netw. 2020, 2020, 21. [Google Scholar] [CrossRef]
Lamaazi, H.; Benamar, N. A Comprehensive Survey on Enhancements and Limitations of the RPL Protocol: A Focus on the Objective Function. Ad Hoc Netw. 2020, 96, 102001. [Google Scholar] [CrossRef]
Pourghebleh, B.; Hayyolalam, V. A Comprehensive and Systematic Review of the Load Balancing Mechanisms in the Internet of Things. Clust. Comput. 2020, 23, 641–661. [Google Scholar] [CrossRef]
Pancaroglu, D.; Sen, S. Load Balancing for RPL-Based Internet of Things: A Review. Ad Hoc Netw. 2021, 116, 102491. [Google Scholar] [CrossRef]
Barto, A.; Sutton, R.S. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 1998; Volume 10. [Google Scholar]
Van Hasselt, H.; Guez, A.; Silver, D. Deep Reinforcement Learning with Double Q-Learning. In Proceedings of the AAAI Conference Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30, pp. 2094–2100. [Google Scholar] [CrossRef]
Lei, J.; Liu, J. Reinforcement Learning-Based Load Balancing for Heavy Traffic Internet of Things. Pervasive Mob. Comput. 2024, 99, 101891. [Google Scholar] [CrossRef]
Tran, T.-N.; Nguyen, T.-V.; Shim, K.; Da Costa, D.B.; An, B. A Deep Reinforcement Learning-Based QoS Routing Protocol Exploiting Cross-Layer Design in Cognitive Radio Mobile Ad Hoc Networks. IEEE Trans. Veh. Technol. 2022, 71, 13165–13181. [Google Scholar] [CrossRef]
Brandt, A.; Hui, J.; Kelsey, R.; Levis, P.; Pister, K.; Struik, R.; Vasseur, J.P.; Alexander, R. RPL: IPv6 Routing Protocol for Low Power and Lossy Networks; Winter, T., Thubert, P., Eds.; IETF (Internet Engineering Task Force): Fremont, CA, USA, 2012. [Google Scholar]
Youssef, M.; Youssef, A.; Younis, M. Overlapping Multihop Clustering for Wireless Sensor Networks. IEEE Trans. Parallel Distrib. Syst. 2009, 20, 1844–1856. [Google Scholar] [CrossRef]
Thubert, P. Objective Function Zero for the Routing Protocol for Low-Power and Lossy Networks (RPL); IETF (Internet Engineering Task Force): Fremont, CA, USA, 2012. [Google Scholar]
Gnawali, O.; Levis, P. The Minimum Rank with Hysteresis Objective Function; IETF (Internet Engineering Task Force): Fremont, CA, USA, 2012. [Google Scholar]
Levis, P.; Clausen, T.; Hui, J.; Gnawali, O.; Ko, J. The Trickle Algorithm; IETF (Internet Engineering Task Force): Fremont, CA, USA, 2011. [Google Scholar]
Russel, S.; Norvig, P. Artificial Intelligence: A Modern Approach; Pearson: Harlow, UK, 2016. [Google Scholar]
Lindelauf, R. Nuclear Deterrence in the Algorithmic Age: Game Theory Revisited. In NL ARMS Netherlands Annual Review of Military Studies 2020; Osinga, F., Sweijs, T., Eds.; NL ARMS; T.M.C. Asser Press: The Hague, The Netherlands, 2021; pp. 421–436. ISBN 978-94-6265-418-1. [Google Scholar]
Musaddiq, A.; Olsson, T.; Ahlgren, F. Reinforcement-Learning-Based Routing and Resource Management for Internet of Things Environments: Theoretical Perspective and Challenges. Sensors 2023, 23, 8263. [Google Scholar] [CrossRef] [PubMed]
Musaddiq, A.; Nain, Z.; Ahmad Qadri, Y.; Ali, R.; Kim, S.W. Reinforcement Learning-Enabled Cross-Layer Optimization for Low-Power and Lossy Networks under Heterogeneous Traffic Patterns. Sensors 2020, 20, 4158. [Google Scholar] [CrossRef]
Kim, H.-S.; Kim, H.; Paek, J.; Bahk, S. Load Balancing Under Heavy Traffic in RPL Routing Protocol for Low Power and Lossy Networks. IEEE Trans. Mob. Comput. 2017, 16, 964–979. [Google Scholar] [CrossRef]
Singh, P.; Chen, Y.-C. RPL Enhancement for a Parent Selection Mechanism and an Efficient Objective Function. IEEE Sens. J. 2019, 19, 10054–10066. [Google Scholar] [CrossRef]
Behrouz Vaziri, B.; Toroghi Haghighat, A. Brad-OF: An Enhanced Energy-Aware Method for Parent Selection and Congestion Avoidance in RPL Protocol. Wirel. Pers. Commun. 2020, 114, 783–812. [Google Scholar] [CrossRef]
Seyfollahi, A.; Ghaffari, A. A Lightweight Load Balancing and Route Minimizing Solution for Routing Protocol for Low-Power and Lossy Networks. Comput. Netw. 2020, 179, 107368. [Google Scholar] [CrossRef]
Safaei, B.; Monazzah, A.M.H.; Ejlali, A. ELITE: An Elaborated Cross-Layer RPL Objective Function to Achieve Energy Efficiency in Internet-of-Things Devices. IEEE Internet Things J. 2021, 8, 1169–1182. [Google Scholar] [CrossRef]
Acevedo, P.D.; Jabba, D.; Sanmartin, P.; Valle, S.; Nino-Ruiz, E.D. WRF-RPL: Weighted Random Forward RPL for High Traffic and Energy Demanding Scenarios. IEEE Access 2021, 9, 60163–60174. [Google Scholar] [CrossRef]
Pushpalatha, M.; Anusha, T.; Rama Rao, T.; Venkataraman, R. L-RPL: RPL Powered by Laplacian Energy for Stable Path Selection during Link Failures in an Internet of Things Network. Comput. Netw. 2021, 184, 107697. [Google Scholar] [CrossRef]
Kalantar, S.; Jafari, M.; Hashemipour, M. Energy and Load Balancing Routing Protocol for IoT. Int. J. Commun. 2023, 36, e5371. [Google Scholar] [CrossRef]
Gaddour, O.; Koubaa, A.; Baccour, N.; Abid, M. OF-FL: QoS-Aware Fuzzy Logic Objective Function for the RPL Routing Protocol. In Proceedings of the 2014 12th International Symposium on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks (WiOpt), Hammamet, Tunisia, 12–16 May 2014; IEEE: Hammamet, Tunisia, 2014; pp. 365–372. [Google Scholar]
Kechiche, I.; Bousnina, I.; Samet, A. A Novel Opportunistic Fuzzy Logic Based Objective Function for the Routing Protocol for Low-Power and Lossy Networks. In Proceedings of the 2019 15th International Wireless Communications & Mobile Computing Conference (IWCMC), Tangier, Morocco, 24–28 June 2019; IEEE: Tangier, Morocco, 2019; pp. 698–703. [Google Scholar]
Frikha, M.S.; Gammar, S.M.; Lahmadi, A.; Andrey, L. Reinforcement and Deep Reinforcement Learning for Wireless Internet of Things: A Survey. Comput. Commun. 2021, 178, 98–113. [Google Scholar] [CrossRef]
Park, H.; Kim, H.; Kim, S.-T.; Mah, P. Multi-Agent Reinforcement-Learning-Based Time-Slotted Channel Hopping Medium Access Control Scheduling Scheme. IEEE Access 2020, 8, 139727–139736. [Google Scholar] [CrossRef]
Banerjee, P.S.; Mandal, S.N.; De, D.; Maiti, B. RL-Sleep: Temperature Adaptive Sleep Scheduling Using Reinforcement Learning for Sustainable Connectivity in Wireless Sensor Networks. Sustain. Comput. Inform. Syst. 2020, 26, 100380. [Google Scholar] [CrossRef]
Saleem, A.; Afzal, M.K.; Ateeq, M.; Kim, S.W.; Zikria, Y.B. Intelligent Learning Automata-Based Objective Function in RPL for IoT. Sustain. Cities Soc. 2020, 59, 102234. [Google Scholar] [CrossRef]
Farag, H.; Stefanovic, C. Congestion-Aware Routing in Dynamic IoT Networks: A Reinforcement Learning Approach. In Proceedings of the 2021 IEEE Global Communications Conference (GLOBECOM), Madrid, Spain, 7–11 December 2021; IEEE: Madrid, Spain, 2021; pp. 1–6. [Google Scholar]
Zahedy, N.; Barekatain, B.; Quintana, A.A. RI-RPL: A New High-Quality RPL-Based Routing Protocol Using Q-Learning Algorithm. J. Supercomput. 2024, 80, 7691–7749. [Google Scholar] [CrossRef]
Han, X.; Xie, M.; Yu, K.; Huang, X.; Du, Z.; Yao, H. Combining Graph Neural Network with Deep Reinforcement Learning for Resource Allocation in Computing Force Networks. Front. Inf. Technol. Electron. Eng. 2024, 25, 701–712. [Google Scholar] [CrossRef]
Manogaran, N.; Raphael, M.T.M.; Raja, R.; Jayakumar, A.K.; Nandagopal, M.; Balusamy, B.; Ghinea, G. Developing a Novel Adaptive Double Deep Q-Learning-Based Routing Strategy for IoT-Based Wireless Sensor Network with Federated Learning. Sensors 2025, 25, 3084. [Google Scholar] [CrossRef] [PubMed]
Zhang, S.; Tong, X.; Chi, K.; Gao, W.; Chen, X.; Shi, Z. Stackelberg Game-Based Multi-Agent Algorithm for Resource Allocation and Task Offloading in MEC-Enabled C-ITS. IEEE Trans. Intell. Transport. Syst. 2025, 26, 17940–17951. [Google Scholar] [CrossRef]
Scarvaglieri, A.; Panebianco, A.; Busacca, F. MAGELLAN: A Distributed MAB-Based Algorithm for Energy-Fair and Reliable Routing in Multi-Hop LoRa Networks. In Proceedings of the GLOBECOM 2024—2024 IEEE Global Communications Conference, Cape Town, South Africa, 8–12 December 2024; IEEE: Cape Town, South Africa, 2024; pp. 2250–2255. [Google Scholar]
Santana, P.; Moura, J. A Bayesian Multi-Armed Bandit Algorithm for Dynamic End-to-End Routing in SDN-Based Networks with Piecewise-Stationary Rewards. Algorithms 2023, 16, 233. [Google Scholar] [CrossRef]
Tanyingyong, V.; Olsson, R.; Hidell, M.; Sjodin, P.; Ahlgren, B. Implementation and Deployment of an Outdoor IoT-Based Air Quality Monitoring Testbed. In Proceedings of the 2018 IEEE Global Communications Conference (GLOBECOM), Abu Dhabi, United Arab Emirates, 9–13 December 2018; IEEE: Abu Dhabi, United Arab Emirates, 2018; pp. 206–212. [Google Scholar]
Teles Hermeto, R.; Gallais, A.; Theoleyre, F. On the (over)-Reactions and the Stability of a 6TiSCH Network in an Indoor Environment. In Proceedings of the 21st ACM International Conference on Modeling, Analysis and Simulation of Wireless and Mobile Systems, Montreal, QC, Canada, 28 October–2 November 2018; ACM: Montreal, QC, Canada, 2018; pp. 83–90. [Google Scholar]
Bildea, A.; Alphand, O.; Rousseau, F.; Duda, A. Link Quality Metrics in Large Scale Indoor Wireless Sensor Networks. In Proceedings of the 2013 IEEE 24th Annual International Symposium on Personal, Indoor, and Mobile Radio Communications (PIMRC), London, UK, 8–11 September 2013; IEEE: London, UK, 2013; pp. 1888–1892. [Google Scholar]
Srinivasan, K.; Levis, P. RSSI Is under Appreciated. In Proceedings of the Third Workshop on Embedded Networked Sensors (EmNets 2006), Cambridge, MA, USA, 30–31 May 2006; Volume 2006, pp. 1–5. [Google Scholar]
Osterlind, F.; Dunkels, A.; Eriksson, J.; Finne, N.; Voigt, T. Cross-Level Sensor Network Simulation with COOJA. In Proceedings of the 2006 31st IEEE Conference on Local Computer Networks, Tampa, FL, USA, 14–16 November 2006; IEEE: Tampa, FL, USA, 2006; pp. 641–648. [Google Scholar]
Eriksson, J.; Österlind, F.; Finne, N.; Tsiftes, N.; Dunkels, A.; Voigt, T.; Sauter, R.; Marrón, P.J. COOJA/MSPSim: Interoperability Testing for Wireless Sensor Networks. In Proceedings of the Second International ICST Conference on Simulation Tools and Techniques, Rome, Italy, 2–6 March 2009; ICST: Rome, Italy, 2009. [Google Scholar]
Contiki: The Open Source Operating System for the Internet of Things. Available online: http://www.contikios.org (accessed on 5 October 2025).
Shirbeigi, M.; Safaei, B.; Mohammadsalehi, A.; Monazzah, A.M.H.; Henkel, J.; Ejlali, A. A Cluster-Based and Drop-Aware Extension of RPL to Provide Reliability in IoT Applications. In Proceedings of the 2021 IEEE International Systems Conference (SysCon), Vancouver, BC, Canada, 15 April–15 May 2021; IEEE: Vancouver, BC, Canada, 2021; pp. 1–7. [Google Scholar]
Al-Hadhrami, Y.; Hussain, F.K. Real Time Dataset Generation Framework for Intrusion Detection Systems in IoT. Future Gener. Comput. Syst. 2020, 108, 414–423. [Google Scholar] [CrossRef]
Sun, G.; Liu, Y.; Chen, Z.; Wang, A.; Zhang, Y.; Tian, D.; Leung, V.C.M. Energy Efficient Collaborative Beamforming for Reducing Sidelobe in Wireless Sensor Networks. IEEE Trans. Mob. Comput. 2021, 20, 965–982. [Google Scholar] [CrossRef]
Dunkels, A.; Osterlind, F.; Tsiftes, N.; He, Z. Software-Based on-Line Energy Estimation for Sensor Nodes. In Proceedings of the 4th Workshop on Embedded Networked Sensors, Cork Ireland, Republic of Ireland, 25–26 June 2007; ACM: Cork, Ireland, 2007; pp. 28–32. [Google Scholar]
Zolertia Z1 Mote. Available online: https://github.com/Zolertia/Resources/wiki/The-Z1-mote (accessed on 15 October 2025).
Thubert, P. Compression Format for IPv6 Datagrams over IEEE 802.15.4-Based Networks; IETF (Internet Engineering Task Force): Fremont, CA, USA, 2011. [Google Scholar]
Vandervelden, T.; Deac, D.; Van Glabbeek, R.; De Smet, R.; Braeken, A.; Steenhaut, K. Evaluation of 6LoWPAN Generic Header Compression in the Context of a RPL Network. Sensors 2023, 24, 73. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Network Graph: (a) Before Using RPL; (b) After Using RPL.

Figure 2. RPL Description Summary.

Figure 5. Proposed General Execution Sequence.

Figure 6. The Complete Workflow of Aris-RPL.

Figure 7. PDR Results in Different Network Scenarios.

Figure 8. Control Overhead Results in Different Network Scenarios.

Figure 9. Total Energy Consumption Results in Different Network Scenarios.

Figure 10. E2E Delay Results in Different Network Scenarios.

Figure 12. Total Aris-RPL Sensitivity Analysis for PDR Metric.

Table 1. Summary of Related Works.

Ref./Pub. Year	Methodology	Research Field	Key Metrics	Simulator	Major Contributions/Results	Limitations
[27] 2017	Rule-Based Objective Function	Routing optimization for RPL networks	QU, HC, ETX	Testbed	According to the simulation scenarios, it improved the packet delivery ratio (PDR) by up to 147% compared to standard RPL.	Not considering metrics like energy usage limits the applicability to dynamic network changes.
[28] 2019	Rule-Based Objective Function	Routing optimization for RPL networks	ETX, QU, Node Lifetime, Latency, Bottleneck Nodes	COOJA	According to the simulation scenarios, it improved energy consumption by 46.5% and communication overhead by 60%, while incurring a 27% increase in delay compared to standard RPL.	Lacked handling of dynamics in larger networks.
[29] 2020	Rule-based objective function	Congestion control in RPL networks	ETX, Residual Energy, Delay, Traffic Intensity	NS-2	According to the simulation scenarios, it reduced energy consumption by up to 40% compared to RPL-MRHOF while maintaining a balanced distribution of traffic load.	Scalability challenges in large networks.
[30] 2020	Rule-based objective function	QoS and load balancing in RPL networks	Parent Conditions, Child Count, Parent Rank	COOJA	According to the simulation scenarios, it reduced energy consumption by 23% and network overhead by 12%, while improving PDR compared to RPL-MRHOF.	Could not handle network dynamics effectively.
[31] 2021	Rule-based objective function	Energy efficiency in RPL Networks	Energy Efficiency, MAC, and RPL Layer Metrics	COOJA	According to the simulation scenarios, it enhanced control overhead, PDR, and energy consumption compared to counterparts by 36.6%, 11.7%, and 39%.	Constrained performance under high traffic loads with high E2E delay.
[32] 2021	Rule-based objective function	Congestion control in RPL Networks	Congestion Rates, HC, Remaining Energy	COOJA	According to the simulation scenarios, it improved PDR and control overhead compared to RPL-MRHOF by an average of 11% and 20%.	Lacked link-quality metrics for decision-making.
[33] 2021	Rule-based objective function	Communication quality in RPL networks	ETX, Link Estimation	COOJA	According to the simulation scenarios, it enhanced PDR, Delay, and Energy Consumption compared to other counterparts by 20%, 24%, and 36%.	High overhead.
[34] 2023	Rule-based objective function	QoS and load balancing in RPL networks	Multi-Metric Parent Selection, Traffic Distribution	COOJA	According to the simulation scenarios, it improved PDR, Delay, and Control Overhead compared to standard RPL by 25%, 20%, and 22%.	High overhead and challenges with dynamic node conditions.
[35] 2014	Fuzzy logic objective function	Effective objective function design for RPL using fuzzy rules	Link Quality, E2E Delay, HC, Energy Consumption	COOJA	According to the simulation scenarios, it reduced the delay by up to 15% and achieved a more balanced energy distribution compared to standard RPL	Limited adaptability due to reliance on predefined rules.
[36] 2019	Fuzzy logic objective function	Reliability optimization in RPL networks	ETX, HC, Parent load, Neighbor Load	COOJA	According to the simulation scenarios, it maintained a stable PDR above 94% and reduced delay while effectively balancing node load using the novel Children Number metric compared to standard RPL.	Limitations in handling network dynamics in large networks.
[40] 2020	Reinforcement Learning objective function	Performance optimization in RPL Networks	ETX, Packet Loss Ratio	COOJA	According to the simulation scenarios, it improved Packet Reception Ratio (PRR) by an average of 7.04%. Energy Consumption by 17.52%. An 18.72% reduction in DIO overhead, and performed marginally better than MRHOF by reducing transmission delays.	Not addressing load balancing explicitly.
[9] 2022	Reinforcement Learning objective function	Performance optimization in RPL networks	EEX (Energy-Enhanced ETX)	COOJA	According to the simulation scenarios, it improved PDR, reduced Energy Consumption and Control Overhead, and maintained lower delay compared to MRHOF.	Not considering complicated evaluation scenarios with different traffic rates.
[26] 2020	Reinforcement Learning objective function	Performance optimization in RPL networks	MAC Contention Metrics	COOJA	According to the simulation scenarios, it improved PRR, reducing Control Overhead by approximately 25–50%, and lowering Total Energy Consumption compared to MRHOF.	Experiencing increased computational complexity in large networks.
[41] 2021	Reinforcement Learning objective function	Congestion control in dynamic RPL networks	Buffer Utilization, HC, ETX	MATLAB	According to the simulation scenarios, it improved network performance by reducing node congestion, resulting in a 16% enhancement in PDR and a 27% reduction in Delay compared to RPL-MRHOF.	Not integrating broader metrics like energy consumption.
[42] 2024	Reinforcement Learning objective function	Quality optimization in RPL Networks	Transmission Success Probability, Residual Energy, Buffer Utilization, HC, ETX	COOJA	According to the simulation scenarios, it enhanced successful delivery, delay, and energy consumption, compared to other counterparts on average by 15.39%, 8.66%, and 25.23%.	High overhead in larger networks and a lack of load-balancing considerations.
[43] 2024	Hybrid GNN and DRL	Dynamic resource allocation	Allocation efficiency, Topology generalization	Python/TensorFlow	According to the simulation scenarios, it enhanced cumulative reward compared to Net_First and Com_First baselines across all topologies on average by 58.0% and 111.2%.	Requires extra techniques to handle heavy computation
[44] 2025	Federated Double Deep Q-Learning	WSN Energy-efficient	Energy, latency, lifetime	MATLAB	According to the simulation scenarios, it enhanced PDR by 5.74%, while significantly reducing delay, energy consumption, and message overhead compared to the baselines.	High computational cost and model synchronization overhead.
[45] 2025	Stackelberg Game-Based Multi-Agent Policy Gradient (SG-MAPG)	Task Offloading in MEC-Enabled C-ITS	Computation Rate, Bandwidth, Channel Gain	Numerical Simulation	Distributed game-theoretic RL approach for joint task offloading and bandwidth resource allocation.	Less focus on the network layer routing
[46] 2024	Distributed routing algorithm based on Multi-Armed Bandit (MAB) learning	Wireless Sensor Networks focusing on multi-hop LoRa networking.	Energy consumption, PDR	LoRa EnergySim	According to the simulation scenarios, it introduced MAGELLAN, which reduced energy consumption by up to 10%, improving PDR by 14% compared to state-of-the-art baselines.	While it provided contributions in their domains, the LoRa is a message-based communication environment that is different compared to the IEEE 802.15.4 standard used in Aris-RPL
[47] 2023	Proposed React-UCB, a Multi-Armed Bandit Bayesian-based algorithm	Wireless Sensor Networks (WSN) and environmental monitoring.	Throughput, delay, and control overhead.	Numerical analysis and case study comparison.	According to the simulation scenarios, it achieved performance enhancements regarding scalability, throughput and delay.	While it provided contributions in its domains, the SDN-based approach relies on a centralized controller to manage routing tables and calculate rewards, while Aris-RPL is distributed
Proposed	Reinforcement Learning objective function	Adaptive routing in dynamic RPL networks	Buffer Utilization, RSSI, Overflow Ratio, Child Count, Energy Usage	COOJA	According to the simulation scenarios, it enhanced Control Overhead, PDR, Delay, and Energy Consumption results compared to other counterparts for all scenarios on average by 39%, 25%, 7%, and 38%.	Lack of latency metric consideration, slightly higher E2E delay in large scenarios

Table 6. ROM and RAM usage of all the Protocols.

RPL Approach	ROM Code Size (Bytes)	% Change in ROM	RAM Consumption (Bytes)	% Change in RAM
Standard RPL (MRHOF)	50,974	---	6030	---
CRD	51,147	+0.34%	6136	+1.76%
MAB	51,176	+0.4%	6056	+0.43%
Aris-RPL	52,142	+2.3%	6060	+0.50%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Aris-RPL: A Multi-Objective Reinforcement Learning Framework for Adaptive and Load-Balanced Routing in IoT Networks

Abstract

1. Introduction

2. Preliminaries

2.1. RPL Protocol Overview

2.2. Reinforcement Learning Overview

3. Related Work

4. Methodology of the Proposed Routing Mechanism

4.1. System Model and Assumptions

4.2. Routing Metrics Used in Aris-RPL

4.2.1. Received Signal Strength Indicator (RSSI)

4.2.2. Buffer Utilization (BU)

4.2.3. Overflow Ratio (OFR)

4.2.4. Reminder Energy (RE)

4.2.5. Child Count

4.3. The Q-Learning Algorithm

4.3.1. The Reward Function

4.3.2. Q-Learning Value Function

4.4. Selection Algorithm

4.5. Network Topology Monitoring Management Function

5. System Setup and Results

5.1. Packet Delivery Ratio (PDR)

5.2. Control Traffic Overhead

5.3. Energy Consumption

5.4. E2E Delay

5.5. Computational Complexity Analysis

5.6. Protocol Implementation of Aris-RPL

5.7. Comparative Study

5.8. Sensitivity Analysis

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Article Metrics

Citations

Article Access Statistics