Skip Content
You are currently on the new version of our website. Access the old version .
TechnologiesTechnologies
  • Article
  • Open Access

3 September 2025

Optimizing TSCH Scheduling for IIoT Networks Using Reinforcement Learning

,
and
1
Innov’COM Laboratory, Tunis 2083, Tunisia
2
RED Laboratory, Gabes 6002, Tunisia
*
Author to whom correspondence should be addressed.
This article belongs to the Section Information and Communication Technologies

Abstract

In the context of industrial applications, ensuring medium access control is a fundamental challenge. Industrial IoT devices are resource-constrained and must guarantee reliable communication while reducing energy consumption. The IEEE 802.15.4e standard proposed time-slotted channel hopping (TSCH) to meet the requirements of the industrial Internet of Things. TSCH relies on time synchronization and channel hopping to improve performance and reduce energy consumption. Despite these characteristics, configuring an efficient schedule under varying traffic conditions and interference scenarios remains a challenging problem. The exploitation of reinforcement learning (RL) techniques offers a promising approach to address this challenge. AI enables TSCH to dynamically adapt its scheduling based on real-time network conditions, making decisions that optimize key performance criteria such as energy efficiency, reliability, and latency. By learning from the environment, reinforcement learning can reconfigure schedules to mitigate interference scenarios and meet traffic demands. In this work, we compare various reinforcement learning (RL) algorithms in the context of the TSCH environment. In particular, we evaluate the deep Q-network (DQN), double deep Q-network (DDQN), and prioritized DQN (PER-DQN). We focus on the convergence speed of these algorithms and their capacity to adapt the schedule. Our results show that the PER-DQN algorithm improves the packet delivery ratio and achieves faster convergence compared to DQN and DDQN, demonstrating its effectiveness for dynamic TSCH scheduling in Industrial IoT environments. These quantifiable improvements highlight the potential of prioritized experience replay to enhance reliability and efficiency under varying network conditions.

1. Introduction

The Industrial Internet of Things (IIoT) is revolutionizing various fields by enabling automation and real-time data exchange [1]. At the core of this transformation lies wireless communication, which serves as the foundation for interactions between sensors, actuators, and control systems. These networks form the backbone of IIoT systems, enabling seamless data transmission and coordination among distributed components.
However, ensuring reliable communication within IIoT networks remains a significant challenge, especially in industrial environments prone to high levels of interference [2]. Interference in such settings can arise from numerous sources, including coexisting wireless networks [3], physical disturbances, and environmental noise. These disruptions can severely affect communication quality, resulting in increased packet loss, higher latency, and increased energy consumption [4]. Consequently, managing interference and maintaining robust communication is crucial to meeting the stringent reliability and performance demands of mission-critical industrial applications.
To address these challenges, the IEEE 802.15.4-2015 time-slotted channel hopping (TSCH) mode has been widely adopted. TSCH improves communication reliability and energy efficiency by taking advantage of two key mechanisms: time synchronization and channel hopping. These mechanisms work in tandem to reduce the impact of interference from various devices, thereby improving overall network robustness [5]. TSCH has been recognized for its capacity to support low-power, resilient, and deterministic communication, which makes it particularly suitable for industrial scenarios.
Despite these advantages, the performance of TSCH can still degrade significantly in the presence of interference, particularly in dynamic or harsh industrial environments. This limitation highlights the need for adaptive and intelligent scheduling strategies that can respond to changing network conditions in real time.
In this study, we introduce an innovative approach that leverages reinforcement learning (RL) to dynamically adapt the TSCH schedule, aiming to mitigate interference and enhance overall network performance [6]. By continuously interacting with the network environment, RL techniques learn optimal scheduling policies and adjust the TSCH schedule in real time [7].
This paper aims to enhance the scheduling performance of TSCH (time-slotted channel hopping) in industrial IoT networks by integrating reinforcement learning techniques to enable adaptive, interference-aware decision-making. The specific objectives of this study are:
  • Model TSCH scheduling as a reinforcement learning (RL) problem: We formalize the TSCH scheduling process as a Markov decision process (MDP), in which the RL agent observes network states (e.g., interference levels, retransmissions, link quality) and learns optimal actions (i.e., slot and channel selection) to improve communication performance.
  • Design an interference-aware reward function: We propose a novel reward function that accounts for packet delivery ratio (PDR), encouraging the agent to adopt scheduling strategies that improve reliability and minimize energy waste.
  • Evaluate and compare advanced RL algorithms: We implement and evaluate three reinforcement learning algorithms—the deep Q-network (DQN), double DQN (DDQN), and prioritized experience replay DQN (PER-DQN)—to assess their effectiveness in terms of convergence speed, adaptability, and network performance.
  • Demonstrate the effectiveness of RL-based TSCH scheduling under interference: Through simulation in interference-prone environments, we assess the RL-based schedulers’ ability to dynamically adapt schedules in real time and improve network metrics such as reliability.
To the best of our knowledge, this is among the first works to systematically evaluate and compare these RL algorithms for adaptive TSCH scheduling. Our study contributes a flexible and intelligent scheduling solution capable of operating in real-time, interference-heavy industrial environments.

2. Technical Background

Before presenting the proposed approach, it is important to provide the necessary technical background. This section introduces the fundamental concepts that will be used throughout the paper, namely time-slotted channel hopping (TSCH) and reinforcement learning (RL). These foundations are essential to understanding both the problem formulation and the motivation for applying RL techniques in wireless communication networks.

2.1. Overview of Time-Slotted Channel Hopping

IEEE 802.15.4-e has proposed TSCH to ensure medium access control for industrial low-power wireless networks. TSCH improves reliability and reduces energy consumption by combining channel hopping and synchronization. Communications are organized as a slotframe structure. Each slotframe is a collection of cells that repeat over time. Each cell is defined by a channel offset and a timeslot as described in Figure 1.
Figure 1. Schedule in an IEEE 802.15.4—TSCH network.
At the beginning of each slot, devices can decide to receive, send, or sleep, turning their radio off to save energy. Let us consider three nodes (B, C, and D) transmitting data to the border router (A) (cf. Figure 1). At the beginning of the slotframe, some shared cells (highlighted in orange) are allocated for control packets on channel offset 0. All links involving a common node (A) must be scheduled in different timeslots. In this example, two dedicated cells are assigned for the transmissions from B to A.
TSCH calculates the number of slots since network creation using the Absolute Sequence Number (ASN) parameter. ASN and channel offset are used to calculate the physical frequency, freq, used during the communication as follows:
Freq = Map ( SlotNumber + Offset ) % SeqLength
where
  • Freq represents the physical frequencythat will be used during the exchange.
  • Map[] is the mapping function that translates an integer into a physical frequency.
  • SlotNumber represents the current slot number.
  • Offset is the channel offset, which shifts the sequence to avoid interference.
  • SeqLength is the length of the mapping sequence, determining how many possible frequencies can be chosen from the sequence.
TSCH defines dedicated and shared cells to optimize performance. Dedicated cells are used by radio links that do not interfere. In shared cells, the contention mechanism is executed to ensure retransmission in case of collision.
The efficiency of TSCH networks depends heavily on the scheduling algorithm. Scheduling must manage complex and often unpredictable network conditions, such as interference and varying traffic loads. These factors affect communication reliability and energy efficiency. Traditional scheduling methods are typically static or heuristic, limiting their adaptability to real-time changes in the network environment.
Despite its advantages, effective scheduling in TSCH remains a challenge due to the dynamic nature of industrial environments, variable traffic patterns, and interference. These challenges motivate the exploration of adaptive scheduling mechanisms, such as those based on reinforcement learning (RL), to dynamically learn and adjust schedules in real time.

2.2. Overview on Reinforcement Learning

Machine learning is divided into three main classes, as described in Figure 2: supervised learning, unsupervised learning, and reinforcement learning [8].
Figure 2. Categories of machine learning.
Supervised Learning relies on labeled data, where algorithms map inputs to known outputs [9]. It is used in classification and regression. Unsupervised learning is suitable for clustering and anomaly detection [10]. It relies on unlabeled datasets.
Reinforcement learning (RL) is based on an agent that interacts with an environment [11]. It learns optimal actions through rewards and penalties. The objective is to develop a strategy that maximizes cumulative rewards over time. This mechanism is used in environments where conditions change over time, requiring adaptive and intelligent decision making. RL is applied in various domains such as robotics, automated control systems, and gaming. Figure 3 illustrates the action and reward strategy of a generic RL mechanism. The agent has to explore new states and maximize its reward simultaneously. It should make the best overall decision in the future by collecting enough information from its environment.
Figure 3. Reinforcement learning.
Basic elements of RL are as follows:
  • Environment: Physical space in which the agent operates;
  • State: The actual situation of the agent;
  • Reward: The received feedback from the agent’s environment;
  • Policy: The strategy used in order to map actions to the agent’s state;
  • Value: Future reward that an agent would receive by taking an action in a particular state.
Reinforcement learning is a decision-making process where an agent interacts with an environment to learn optimal decisions. At each time step, the agent notices its state S t and selects an action A t based on a policy that defines its strategy for choosing actions. The environment responds by transitioning to a new state S t + 1 and providing a reward R t , which serves as feedback. The agent’s goal is to maximize cumulative rewards, considering a discount factor γ . Over time, the agent enhances its policy through exploration and exploitation, progressively improving its decisions to achieve optimal performance in the considered environment.

4. Problem Statement

TSCH has emerged as the de facto MAC standard for industrial IoT (IIoT) applications due to its ability to provide reliable and energy-efficient communication. By combining time-slot scheduling with frequency hopping, TSCH addresses the stringent reliability and latency requirements of industrial environments. However, its performance significantly degrades under unpredictable traffic patterns and rapidly changing network conditions. In the presence of interference, the packet delivery ratio (PDR) is reduced and latency increases, which undermines the reliability guarantees expected in mission-critical scenarios. Existing predefined scheduling mechanisms are limited by their static nature; they lack adaptability to dynamic traffic loads and fail to prevent collisions efficiently in environments with interference. This limitation creates a clear gap between TSCH’s theoretical potential and its real-world performance in harsh IIoT conditions.
To bridge this gap, adaptive and intelligent scheduling is required. Reinforcement learning (RL) provides a promising solution by enabling agents to interact with the network environment and progressively optimize scheduling decisions. In particular, deep Q-networks (DQNs) and their extensions, such as double DQNs and prioritized DQNs, offer the capacity to learn effective scheduling policies in high-dimensional state spaces. By modeling slot and channel selection as a Markov decision process (MDP), an RL agent can dynamically adapt to interference, ensuring improved network performance compared to static approaches.
The objectives of this research are as follows:
  • Define TSCH scheduling as an RL problem where the agent makes decisions based on observed network states.
  • Develop and compare the performance of DQN, double DQN, and prioritized DQN for adaptive scheduling.
  • Evaluate the proposed solutions using key metrics such as packet delivery ratio (PDR), and reward convergence.
The expected outcome is an adaptive and interference-aware scheduling framework that enhances TSCH performance in IIoT applications. By leveraging advanced RL techniques, the solution aims to ensure robust, efficient, and resilient communication, thereby overcoming the shortcomings of predefined scheduling mechanisms and meeting the stringent requirements of industrial networks.

5. Formulation of DQN-Based TSCH Scheduling

In this part, we formulate the use of deep Q-networks (DQNs) in the context of time-slotted channel hopping (TSCH) scheduling. While designing our proposed RL scheduling mechanism, we have to consider different constraints such as energy consumption optimization, interference avoidance, and transmission reliability:
  • Collision Avoidance: The objective of the proposed solution is to assign the channel that minimizes interference by avoiding channels that may lead to collisions.
  • Energy Efficiency: Efficient scheduling strategies reduce energy consumption, which is a critical factor in low-power wide-area networks (LPWANs). Therefore, we have to implement an efficient communication strategy to balance network performance and power usage.
  • Packet Delivery Ratio (PDR): This metric measures the reliability of transmission. Low PDR indicates that interference has occurred. An adaptive scheduling strategy may reduce packet loss and improve the PDR.
We model the TSCH network as a Markov decision process (MDP) based on the state space, action space, and reward function. The state space represents network conditions such as interference levels and link quality. The action space consists of selecting slots and channels for packet transmission. We design the reward function to avoid collisions and maximize the packet delivery ratio. The goal of our solution is to model the TSCH environment as a reinforcement learning (RL) task, where the agent (RL model) learns to optimize the scheduling process.

5.1. State Modeling

For a given hop i, we model the state S t at time t as a vector that represents key transmission characteristics across all hops within a packet. Specifically, the state is defined as:
S t = A 1 , R 1 , F 1 , RSSI 1 , A 2 , R 2 , F 2 , RSSI 2 , , A n , R n , F n , RSSI n
where n is the total number of hops. Each hop is represented by four parameters:
  • A i : Address of the hop;
  • R i : Number of retransmissions;
  • F i : Assigned transmission frequency;
  • RSSI i : Received signal strength indicator, reflecting the link quality.
Results are structured in a vector that captures the link quality indicators and topology parameters. DQN uses this representation to learn optimal decisions in order to improve network performance.

5.2. Action Representation

The action state consists of the possible decisions the agent can take to maximize rewards. Actions are selected using an ϵ-greedy policy, where an agent can take the following actions:
  • Exploit: Choose the action with the highest Q-value (greedy choice).
  • Explore: Select a random action with probability ϵ to improve learning.
At time t, the action state represents all channel assignments that the agent can select from the current hop. We denote Nchannels the total number of channels that a node can select. In TSCH, the agent has 16 channels to select from at each step. We denote the action state as the action space:
A t { 0 , 1 , 2 , , N channels 1 }

5.3. Reward Function

To improve scheduling decisions, the RL agent has to maximize rewards over time. Positive rewards represent the effectiveness of actions.
In TSCH scheduling, the reward function R t is used to evaluate the selection of channels based on PDR values. At each step, the reward function is calculated as a function of the packet delivery ratio (PDR). We encourage the agent to adopt a strategy that improves communication reliability.
We define the reward function as:
R t = α · PDR
where
  • PDR represents the ratio of successfully delivered packets.
  • α is a factor that adjusts the impact of PDR on the reward function.
By relying on PDR, the reward function promotes successful transmissions. The agent adopts scheduling strategies that maximize network performance and data delivery.

5.4. State Transition Function

The state transition S t + 1 models the new state after the RL agent selects an action A t . The transition depends on the current packet’s hop parameters and the selected channel selected by the agent:
s t + 1 = f ( S t , A t )
where f is a function that models the update of the state based on the action. The state modification reflects the following:
  • The presence of successful transmission on the selected channel.
  • The quality of the transmission due to interference or successful communication.

5.5. DQN Update Rule

The goal of DQN is to learn the optimal action strategy that improves the expected future rewards. The Q-value for a state-action pair ( S t , A t ) depends on the Bellman equation:
Q ( S t , A t ) Q ( S t , A t ) + α R t + γ max a Q ( S t + 1 , a ) Q ( S t , A t )
where
  • α represents the learning rate. It evaluates how much new information overrides the old.
  • γ represents the importance of future rewards.
  • max a Q ( s t + 1 , a ) is the maximum predicted Q-value for the next state S t + 1 .
To determine Q-values, we use a neural network instead of a Q-table. We update the neural network in order to obtain an optimal strategy.

5.6. Proposed Algorithm

The proposed algorithm leverages deep Q-networks (DQNs) to optimize TSCH (time-slotted channel hopping) scheduling by dynamically selecting communication channels to minimize interference. The detailed steps are presented in Algorithm 1.
Algorithm 1: Summary of TSCH Scheduling Algorithm with DQN
Technologies 13 00400 i001
The process begins with initializing the simulation environment and the DQN agent, which is trained through multiple episodes. In each episode, the agent verifies the network state, selects an action (a communication channel) using an ϵ -greedy policy, and receives a reward based on the effectiveness of the selection. The experience is stored and used to update the DQN model through replay of experiences, ensuring improved decision-making over time.
As training progresses, the exploration rate ϵ is gradually reduced to favor exploitation over exploration.
The final model aims to optimize scheduling by reducing interference and thus improving network performance.

6. Double DQN for TSCH Scheduling

The double deep Q-network (double DQN) is an enhanced form of the deep Q-network (DQN). This algorithm is proposed to minimize over-estimation in Q-learning. In standard DQN, the use of the same neural network in both selecting and evaluating actions leads to Q-value overestimation.
The problem of overestimation is addressed by DDQN. It proposes to use two different networks to select the best action and estimate its value. The first network is called an online network, it selects the action with the highest Q-value. The second one is the target network. It is used to evaluate the action in order to reduce the overestimation.
Double DQN offers good performance in complex environments since it improves learning efficiency.
Double DQN (DDQN) improves DQN by using two separate neural networks to reduce the overestimation of Q-values. Given a state–action pair ( S , A ) , the Q-value update in the standard DQN is defined in the following section.

6.1. Mathematical Formulation of DDQN

Double DQN aims to learn the optimal action strategy that enhances future rewards. The Q-value for a state–action pair ( S t , A t ) is updated using the Bellman equation as follows:
Q ( S t , A t ) Q ( S t , A t ) + α R t + γ Q ( S t + 1 , arg max a Q ( S t + 1 , a ; θ ) ) Q ( S t , A t )
where
  • α models the learning rate used to override old information.
  • γ evaluates how much the agent cares about future rewards. If this parameter is close to zero, it means that the agent is focused on intermediate rewards. Otherwise, the agent focuses on long-term rewards.
  • arg max a Q ( S t + 1 , a ; θ ) represents the action selected by the online network for the next state S t + 1 .
  • Q ( S t + 1 , arg max a Q ( S t + 1 , a ; θ ) ) is the Q-value of the action selected evaluated by the target network. It serves to reduce overestimation.
The target Q-value in double DQN is updated as follows:
y t = R t + γ Q θ S t + 1 , arg max a Q θ ( S t + 1 , a )
where
  • R t is the reward at time t.
  • γ is the discount factor.
  • Q θ is the target network, and  Q θ is the online network.
  • S t + 1 is the state at time t + 1 .
  • arg max a Q θ ( S t + 1 , a ) selects the action that maximizes the Q-value in the next state S t + 1 .
The objective is to maximize the cumulative discounted rewards.
G t = k = 0 γ k R t + k
where γ is the discount factor that determines the importance of future rewards.

6.2. Proposed DDQN Algorithm

As shown in Algorithm 2, we apply a Double DQN approach to optimize TSCH scheduling.
Algorithm 2: Double DQN for TSCH scheduling
Technologies 13 00400 i002

7. Formulation of Prioritized Experience Replay Deep Q-Network (PER-DQN) for TSCH Scheduling

7.1. Mathematical Formulation

As described previously, the environment is represented by the TSCH network state S t , It includes available channels, interference indicators, and slots. The action A t represents the scheduling choices made by the agent at time t.
We define the following components:
  • Replay Buffer 𝒟: This buffer is used to store experiences ( S t , A t , R t , S t + 1 ) , where S t is the state, A t is the selected action, R t is the given reward, and  S t + 1 is the next state after the execution of the action A t .
  • Action–Value Function Q ( S , S ; θ ) : A neural network parameterized by θ that estimates the expected return (reward) for taking action A in state S.
  • Target network Q ( S , A ; θ ) : Represents the action-value function network. It is updated periodically with the parameters of the main network.
For each episode, the first step consists of initializing the state and resetting the environment. Then, we select actions using the ϵ -greedy policy, while balancing exploration and exploitation. The selected action is executed, offering a reward and passing to the next state. The temporal difference (TD) error is calculated, followed by computing the priority for storing the experience in the buffer. Experiences are stored with prioritized sampling, and once the buffer has enough samples, a mini-batch is replayed. For each experience, importance-sampling weights and target values are calculated, followed by the computation of the loss function using the weighted mean squared error. The Q-network is updated via gradient descent, and priorities in the buffer are adjusted. Periodically, the target network is updated, and exploration gradually shifts toward exploitation. Finally, the agent’s performance is analyzed based on metrics such as reward and packet delivery ratio (PDR) for the task at hand.

7.2. Proposed Algorithm

We further enhance the DQN approach using Prioritized Experience Replay, as detailed in Algorithm 3.
Algorithm 3: Prioritized Experience Replay Deep Q-Network (PER-DQN)
Technologies 13 00400 i003
The prioritized experience replay deep Q-network (PER-DQN) algorithm is an enhancement of the standard deep Q-network (DQN) that improves learning efficiency by prioritizing important transitions in the replay buffer. In this algorithm, a TSCH network’s state, including interference, slot matrix, and channels, is used to optimize scheduling and minimize interference. Initially, the algorithm sets up the replay buffer, action-value function, and a target network. During training, an ϵ -greedy policy is used to select actions, and transitions are stored in the buffer with a priority determined by the temporal difference (TD) error. When the buffer contains enough samples, a mini-batch is sampled based on these priorities, and a gradient descent step is performed to minimize the loss and update the priorities. The target network is periodically updated, and ϵ and β are adjusted over time. Ultimately, the algorithm optimizes the scheduling process by continuously updating its policy to minimize interference in the network.

8. Simulation Environment

The simulation environment replicates TSCH scheduling in a multi-hop industrial IoT network. Each packet is modeled with global identifiers and hop-specific data for up to six hops, including node address, retransmission count, channel frequency, and RSSI.
The action space is defined as the selection of one frequency channel from 16 available channels. At each step, the reinforcement learning agent chooses a channel, and the environment evaluates the resulting transmission outcome based on interference, retransmissions, and signal quality.
The reward function is based on the packet delivery ratio (PDR) calculated for each packet during the scheduling process. Specifically, the PDR is defined as the ratio of successful transmissions within all the hops of the current packet. A successful transmission corresponds to a hop where the RSSI is above -80 dBm and the number of retransmissions is less than or equal to two. This approach provides timely and localized feedback to the agent after each scheduling decision, enabling effective learning. The cumulative reward over an episode reflects the overall performance of the scheduling policy across multiple packets.
The training process spans 500 episodes, with parameters set as follows: learning rate = 0.001, discount factor γ = 0.95 , and exploration rate starting from ϵ = 1.0 with exponential decay ϵ max ( ϵ × 0.995 , 0.01 ) .

9. Realization and Simulation Evaluation

In this section, we describe the work conducted. We implemented the deep Q-network (DQN), double DQN, and prioritized experience replay DQN (prioritized DQN) in the context of the TSCH (time-slotted channel hopping) scheduling mechanism. The objective was to enhance scheduling decisions in a dynamic and power-constrained network environment.
Thus, we propose a reinforcement learning (RL) agent that interacts with the TSCH environment. We model the action value state using deep learning techniques. We train the agent using a neural network.
For prioritized DQN, we implemented a priority-based experience replay mechanism to enhance the learning process by prioritizing important transitions. This approach aims to improve convergence speed and learning efficiency compared to the standard DQN method.
The reinforcement learning setup is based on RL, where the state space represents the hop information of each packet, including address, retransmissions, frequency, and RSSI. The action space consists of selecting one of the 16 frequency channels for scheduling. The reward function incentivizes interference-free scheduling by penalizing actions that cause frequency collisions and rewarding those that enhance successful transmissions. Training was conducted over 500 episodes with an initial exploration rate ( ϵ ) of 1.0, gradually decaying to 0.01, using a learning rate of 0.001 and a discount factor ( γ ) of 0.95. The simulations aimed to allow the agent to learn optimal scheduling policies, minimizing interference and maximizing packet delivery efficiency under varying network conditions.

10. Performance of DQN in TSCH

10.1. Results of DQN Performance in TSCH

The following figures depict the performance of the DQN algorithm in a time-slotted channel hopping (TSCH) environment. Each figure highlights key performance metrics over 500 training episodes. First, we evaluate the convergence of DQN and its ability to adapt the schedule and improve transmission effeciency.

10.2. Reward Progression

Figure 4 presents the evolution of rewards over training. Reward values are fluctuating due to the dynamic interaction with the TSCH environnement. First, the rewards are equal to 6000. Then, they increase to 7500 in the last stages. This evolution means that the DQN agent can improve its decision strategy.
Figure 4. Evolution of rewards.

10.3. Packet Delivery Ratio Evaluation

Figure 5 depicts the packet delivery ratio (PDR) over episodes. We observe a clear progression in terms of PDR. It starts around 0.72 and increases to 0.8 in the final stages. This improvement indicates that the agent is able to optimize its scheduling decisions. A large proportion of packets are successfully transmitted. The TSCH network is able to ensure data transmission with lower packet loss.
Figure 5. Evolution of PDR.

10.4. Epsilon Decay over Episodes

We illustrate in Figure 6 the progression of epsilon over training episodes. Epsilon implements a greedy strategy that balances exploration and exploitation. Epsilon is equal to 1 at the beginning of training. Higher values of epsilon indicate that the agent selects random actions to explore the environment. As epsilon decreases, the agent moves to an exploitative phase. It converges to an optimal strategy that prioritizes actions offering higher rewards.
Figure 6. Epsilon decay curve.

11. Analysis of DDQN Effectiveness in TSCH

11.1. DDQN Performance Metrics

In this section, we present the results of implementing the double deep Q-network (DDQN) within the time-slotted channel hopping (TSCH) network. Our objective is to study the effectiveness of the DDQN in optimizing TSCH scheduling.

11.2. Reward Progression

The reward values presented in Figure 7 show a clear evolution from initial to final episodes. The DDQN agent is able to maximize rewards compared to the DQN agent. While moving from exploration to exploitation, the agent is able to learn and select an optimal schedule. The ability of DDQN to achieve higher rewards improves the network performance and the efficiency of packet transmission.
Figure 7. Evolution of rewards.

11.3. Evolution of Packet Delivery Ratio

As training progresses, the PDR described in Figure 8 exhibits a gradual upward trend, reaching up to 97% in the final episodes. While the curve presents fluctuations due to the exploration process and environmental dynamics, a general improvement in PDR is observed over time. The DDQN agent learns to prioritize slots with lower interference and reduced packet loss. Despite some variance, the PDR values commonly center around 90%, reflecting the agent’s ability to converge toward an effective and robust scheduling policy that enhances network performance.
Figure 8. Evolution of PDR.
As outlined earlier, the epsilon value decreases progressively throughout the training (Figure 9). This indicates a transition from an exploration phase to an exploitation phase. After testing different actions in the first phase, the agent relies on the learned strategies and applies them to optimize performance.
Figure 9. Epsilon decay curve.

12. Analysis of PER-DQN Effectiveness in TSCH

12.1. PER-DQN Performance Metrics

In this part, we evaluate the performance of applying the deep Q-network (DQN) enhanced with prioritized experience replay (PER) to the time-slotted channel hopping (TSCH) protocol. We focus on the same metrics: rewards, epsilon values, and packet delivery ratio (PDR).

12.2. Evolution of Rewards

As shown in Figure 10, the reward progression in PER-DQN shows a steady improvement throughout training. Initially, rewards fluctuate around 7000, reflecting the model’s early exploration phase. As training advances, rewards increase to approximately 7500 by mid-training and eventually peak at 8500 in the final episodes. This consistent growth suggests that PER-DQN effectively optimizes scheduling decisions, leading to better network performance. Unlike traditional DQN and double DQN, the use of prioritized experience replay allows the model to focus on crucial experiences, accelerating learning and reducing suboptimal decisions.
Figure 10. Evolution of rewards.

12.3. Packet Delivery Ratio (PDR) Trends

Figure 11 illustrates that the PDR begins at approximately 0.93, reflecting a relatively high initial transmission success rate. As training progresses, the PDR stabilizes with occasional variations. By the final episodes, the PDR reaches 0.98, demonstrating a significant improvement in packet transmission reliability. Although the PDR curve shows some fluctuations during training, this is expected in reinforcement learning due to the exploration mechanism and the stochastic nature of the environment. The agent continues to explore different actions to improve its performance, which can occasionally result in lower PDR values. However, the overall trend remains high—mostly above 96%, indicating that the agent has learned an effective policy. These fluctuations do not reflect instability but are a normal part of the learning process in dynamic environments. Compared to standard DQN, PER-DQN converges to a higher PDR, likely due to its ability to prioritize critical experiences and adapt more effectively to dynamic conditions.
Figure 11. Evolution of PDR.
The shift from exploration to exploitation is described in Figure 12. Epsilon values follow a typical pattern. They start at 0.995 and decrease to 0 in the final episodes.
Figure 12. Epsilon decay curve.

13. Discussion and Interpretation of Results

Our simulations outline the advantages of using reinforcement learning in TSCH environments. RL is a promising technique for scheduling in industrial Internet of Things applications.
The results indicate that all RL algorithms ensure good learning efficiency and convergence. However, the prioritized DQN, compared to the DQN and the double DQN, converges faster and presents the best performance. Thus, PER-DQN is well suited for industrial applications that require rapid decision-making and scheduling to manage processes efficiently, often in real time.

14. Conclusions

In this paper, we evaluated the performance of reinforcement learning mechanisms to optimize scheduling in TSCH-based networks. We applied deep Q-learning (DQN), double DQN, and prioritized DQN (PDQN) in a TSCH environment. Through simulations, we proved that all reinforcement learning mechanisms minimize interference and improve the packet delivery ratio under varying network conditions. By leveraging DQN, double DQN, and PDQN, the RL agent learns optimal frequency channel allocation policies. Adaptative scheduling decisions help to enhance fiability and network performance.
The obtained results indicate that applying prioritized DQN guarantees faster convergence. Moreover, PER-DQN results in the best performance compared to other RL techniques due to replaying experiences with higher TD-error. Thus, the RL agent is able to optimize its scheduling decisions by prioritizing the most significant experiences.
While the proposed approaches show promising results, future contributions may extend the models to consider more complex network topologies. We also plan to investigate dynamic traffic patterns with varying packet generation rates and bursty behavior, as well as advanced interference scenarios involving cross-technology interference and multi-channel contention. In addition, we aim to adapt the current RL models by refining the state representation and reward formulation to better capture these complexities. Finally, we plan to validate the proposed methods in real environments through experimental tests on IoT platforms to assess their practical applicability.

Author Contributions

Conceptualization, S.B.Y. (Sahar Ben Yaala), S.B.Y. (Sirine Ben Yaala) and R.B.; methodology, S.B.Y. (Sahar Ben Yaala), S.B.Y. (Sirine Ben Yaala) and R.B.; software, S.B.Y. (Sahar Ben Yaala), S.B.Y. (Sirine Ben Yaala) and R.B.; validation, S.B.Y. (Sahar Ben Yaala), S.B.Y. (Sirine Ben Yaala) and R.B.; formal analysis, S.B.Y. (Sahar Ben Yaala), S.B.Y. (Sirine Ben Yaala) and R.B.; investigation, S.B.Y. (Sahar Ben Yaala), S.B.Y. (Sirine Ben Yaala) and R.B.; resources, S.B.Y. (Sahar Ben Yaala), S.B.Y. (Sirine Ben Yaala) and R.B.; data curation, S.B.Y. (Sahar Ben Yaala), S.B.Y. (Sirine Ben Yaala) and R.B.; writing—original draft preparation, S.B.Y. (Sahar Ben Yaala), S.B.Y. (Sirine Ben Yaala) and R.B.; writing—review and editing, S.B.Y. (Sahar Ben Yaala), S.B.Y. (Sirine Ben Yaala) and R.B.; visualization, R.B.; supervision, R.B.; project administration, S.B.Y. (Sahar Ben Yaala), S.B.Y. (Sirine Ben Yaala) and R.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The research data supporting this publication are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Sisinni, E.; Saifullah, A.; Han, S.; Jennehag, U.; Gidlund, M. Industrial internet of things: Challenges, opportunities, and directions. IEEE Trans. Ind. Inform. 2018, 14, 4724–4734. [Google Scholar] [CrossRef]
  2. Bi, S.; Zeng, Y.; Zhang, R. Wireless powered communication networks: An overview. IEEE Wirel. Commun. 2016, 23, 10–18. [Google Scholar] [CrossRef]
  3. Yaala, S.B.; Théoleyre, F.; Bouallegue, R. Performance study of co-located IEEE 802.15. 4-TSCH networks: Interference and coexistence. In Proceedings of the 2016 IEEE Symposium on Computers and Communication (ISCC), Messina, Italy, 27–30 June 2016; pp. 513–518. [Google Scholar]
  4. Yaala, S.B.; Théoleyre, F.; Bouallegue, R. Cooperative resynchronization to improve the reliability of colocated IEEE 802.15. 4-TSCH networks in dense deployments. Ad Hoc Netw. 2017, 64, 112–126. [Google Scholar] [CrossRef]
  5. Vilajosana, X.; Wang, Q.; Chraim, F.; Watteyne, T.; Chang, T.; Pister, K.S. A realistic energy consumption model for TSCH networks. IEEE Sensors J. 2013, 14, 482–489. [Google Scholar] [CrossRef]
  6. Wiering, M.A.; Van Otterlo, M. Reinforcement Learning. Adaptation, Learning, and Optimization; Springer: Berlin/Heidelberg, Germany, 2012; Volume 12, p. 729. [Google Scholar]
  7. Kayhan, B.M.; Yildiz, G. Reinforcement learning applications to machine scheduling problems: A comprehensive literature review. J. Intell. Manuf. 2023, 34, 905–929. [Google Scholar] [CrossRef]
  8. Dietterich, T.G. Machine learning. Annu. Rev. Comput. Sci. 1990, 4, 255–306. [Google Scholar] [CrossRef]
  9. Zhao, Z.; Alzubaidi, L.; Zhang, J.; Duan, Y.; Gu, Y. A comparison review of transfer learning and self-supervised learning: Definitions, applications, advantages and limitations. Expert Syst. Appl. 2024, 242, 122807. [Google Scholar] [CrossRef]
  10. Naeem, S.; Ali, A.; Anam, S.; Ahmed, M.M. An unsupervised machine learning algorithms: Comprehensive review. Int. J. Comput. Digit. Syst. 2023, 13, 911–921. [Google Scholar] [CrossRef]
  11. Matsuo, Y.; LeCun, Y.; Sahani, M.; Precup, D.; Silver, D.; Sugiyama, M.; Uchibe, E.; Morimoto, J. Deep learning, reinforcement learning, and world models. Neural Netw. 2022, 152, 267–275. [Google Scholar] [CrossRef]
  12. Silvestrini, S.; Lavagna, M. Deep learning and artificial neural networks for spacecraft dynamics, navigation and control. Drones 2022, 6, 270. [Google Scholar] [CrossRef]
  13. Keith, A.J.; Ahner, D.K. A survey of decision making and optimization under uncertainty. Ann. Oper. Res. 2021, 300, 319–353. [Google Scholar] [CrossRef]
  14. Naderializadeh, N.; Sydir, J.J.; Simsek, M.; Nikopour, H. Resource management in wireless networks via multi-agent deep reinforcement learning. IEEE Trans. Wirel. Commun. 2021, 20, 3507–3523. [Google Scholar] [CrossRef]
  15. Alwarafy, A.; Abdallah, M.; Ciftler, B.S.; Al-Fuqaha, A.; Hamdi, M. Deep reinforcement learning for radio resource allocation and management in next generation heterogeneous wireless networks: A survey. arXiv 2021, arXiv:2106.00574. [Google Scholar] [CrossRef]
  16. Jaber, M.M.; Ali, M.H.; Abd, S.K.; Jassim, M.M.; Alkhayyat, A.; Jassim, M.; Alkhuwaylidee, A.R.; Nidhal, L. Q-learning based task scheduling and energy-saving MAC protocol for wireless sensor networkss. Wirel. Netw. 2024, 30, 4989–5005. [Google Scholar] [CrossRef]
  17. Upadhyay, P.; Marriboina, V.; Goyal, S.J.; Kumar, S.; El-Kenawy, E.S.; Ibrahim, A.; Alhussan, A.A.; Khafaga, D.S. An improved deep reinforcement learning routing technique for collision-free VANET. Sci. Rep. 2023, 13, 21796. [Google Scholar] [CrossRef]
  18. Lu, C.; Wang, Z.; Ding, W.; Li, G.; Liu, S.; Cheng, L. MARVEL: Multi-agent reinforcement learning for VANET delay minimization. China Commun. 2021, 18, 1–11. [Google Scholar] [CrossRef]
  19. Liu, J.C.; Susanto, H.; Huang, C.J.; Tsai, K.L.; Leu, F.Y.; Hong, Z.Q. A Q-learning-based downlink scheduling in 5G systems. Wirel. Netw. 2024, 30, 6951–6972. [Google Scholar] [CrossRef]
  20. Azimi, Y.; Yousefi, S.; Kalbkhani, H.; Kunz, T. Applications of machine learning in resource management for RAN-slicing in 5G and beyond networks: A survey. IEEE Access 2022, 10, 106581–106612. [Google Scholar] [CrossRef]
  21. Tanveer, J.; Haider, A.; Ali, R.; Kim, A. An overview of reinforcement learning algorithms for handover management in 5G ultra-dense small cell networks. Appl. Sci. 2022, 12, 426. [Google Scholar] [CrossRef]
  22. Frikha, M.S.; Gammar, S.M.; Lahmadi, A.; Andrey, L. Reinforcement and deep reinforcement learning for wireless Internet of Things: A survey. Comput. Commun. 2021, 178, 98–113. [Google Scholar] [CrossRef]
  23. Jarwan, A.; Ibnkahla, M. Edge-based federated deep reinforcement learning for IoT traffic management. IEEE Internet Things J. 2022, 10, 3799–3813. [Google Scholar] [CrossRef]
  24. Nguyen, T.G.; Phan, T.V.; Hoang, D.T.; Nguyen, T.N.; So-In, C. Federated deep reinforcement learning for traffic monitoring in SDN-based IoT networks. IEEE Trans. Cogn. Commun. Netw. 2021, 7, 1048–1065. [Google Scholar] [CrossRef]
  25. Zhang, J.; Ye, M.; Guo, Z.; Yen, C.Y.; Chao, H.J. CFR-RL: Traffic engineering with reinforcement learning in SDN. IEEE J. Sel. Areas Commun. 2020, 38, 2249–2259. [Google Scholar] [CrossRef]
  26. Hussain, F.; Anpalagan, A.; Khwaja, A.S.; Naeem, M. Resource allocation and congestion control in clustered M2M communication using Q-learning. Trans. Emerg. Telecommun. Technol. 2017, 28, e3039. [Google Scholar]
  27. Huang, Y. Deep Q-networks. In Deep Reinforcement Learning: Fundamentals, Research and Applications; Springer: Singapore, 2020; pp. 135–160. [Google Scholar]
  28. Yao, G.; Zhang, N.; Duan, Z.; Tian, C. Improved SARSA and DQN algorithms for reinforcement learning. Theor. Comput. Sci. 2025, 1027, 115025. [Google Scholar] [CrossRef]
  29. Oakes, B.; Richards, D.; Barr, J.; Ralph, J. Double deep Q networks for sensor management in space situational awareness. In Proceedings of the 2022 25th International Conference on Information Fusion (FUSION), Linköping, Sweden, 4–7 July 2022; pp. 1–6. [Google Scholar]
  30. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
  31. Zhou, W.; Li, Y.; Yang, Y.; Wang, H.; Hospedales, T. Online meta-critic learning for off-policy actor-critic methods. Adv. Neural Inf. Process. Syst. 2020, 33, 17662–17673. [Google Scholar]
  32. Canese, L.; Cardarilli, G.C.; Di Nunzio, L.; Fazzolari, R.; Giardino, D.; Re, M.; Spanò, S. Multi-agent reinforcement learning: A review of challenges and applications. Appl. Sci. 2021, 11, 4948. [Google Scholar] [CrossRef]
  33. Pateria, S.; Subagdja, B.; Tan, A.H.; Quek, C. Hierarchical reinforcement learning: A comprehensive survey. ACM Comput. Surv. (CSUR) 2021, 54, 1–35. [Google Scholar] [CrossRef]
  34. Clifton, J.; Laber, E. Q-learning: Theory and applications. Annu. Rev. Stat. Its Appl. 2020, 7, 279–301. [Google Scholar] [CrossRef]
  35. Li, Y.; Yu, C.; Shahidehpour, M.; Yang, T.; Zeng, Z.; Chai, T. Deep reinforcement learning for smart grid operations: Algorithms, applications, and prospects. Proc. IEEE 2023, 111, 1055–1096. [Google Scholar] [CrossRef]
  36. Zhao, D.; Wang, H.; Shao, K.; Zhu, Y. Deep reinforcement learning with experience replay based on SARSA. In Proceedings of the 2016 IEEE Symposium Series on Computational Intelligence (SSCI), Athens, Greece, 6–9 December 2016; pp. 1–6. [Google Scholar]
  37. Wu, J.; Zhang, G.; Nie, J.; Peng, Y.; Zhang, Y. Deep reinforcement learning for scheduling in an edge computing-based industrial internet of things. Wirel. Commun. Mob. Comput. 2021, 2021, 8017334. [Google Scholar] [CrossRef]
  38. Zhang, M.; Chen, Y.; Susilo, W. PPO-CPQ: A privacy-preserving optimization of clinical pathway query for e-healthcare systems. IEEE Internet Things J. 2020, 7, 10660–10672. [Google Scholar] [CrossRef]
  39. Abbas, K.; Cho, Y.; Nauman, A.; Khan, P.W.; Khan, T.A.; Kondepu, K. Convergence of AI and MEC for autonomous IoT service provisioning and assurance in B5G. IEEE Open J. Commun. Soc. 2023, 4, 2913–2929. [Google Scholar] [CrossRef]
  40. Deng, J.; Sierla, S.; Sun, J.; Vyatkin, V. Reinforcement learning for industrial process control: A case study in flatness control in steel industry. Comput. Ind. 2022, 143, 103748. [Google Scholar] [CrossRef]
  41. Xu, X.; Jia, Y.; Xu, Y.; Xu, Z.; Chai, S.; Lai, C.S. A multi-agent reinforcement learning-based data-driven method for home energy management. IEEE Trans. Smart Grid 2020, 11, 3201–3211. [Google Scholar] [CrossRef]
  42. Bahrpeyma, F.; Reichelt, D. A review of the applications of multi-agent reinforcement learning in smart factories. Front. Robot. AI 2022, 9, 1027340. [Google Scholar] [CrossRef]
  43. Marini, R.; Park, S.; Simeone, O.; Buratti, C. Continual meta-reinforcement learning for UAV-aided vehicular wireless networks. In Proceedings of the ICC 2023—IEEE International Conference on Communications, Rome, Italy, 28 May–1 June 2023; pp. 5664–5669. [Google Scholar]
  44. Neužil, J. Distributed Signal Processing in Wireless Sensor Networks for Diagnostics. Ph.D. Thesis, Czech Technical University, Prague, Czechia, 2016. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.