RL-BMAC: An RL-Based MAC Protocol for Performance Optimization in Wireless Sensor Networks

Khan, Owais; Ullah, Sana; Khan, Muzammil; Chao, Han-Chieh

doi:10.3390/info16050369

Open AccessArticle

RL-BMAC: An RL-Based MAC Protocol for Performance Optimization in Wireless Sensor Networks

¹

Department of Computer and Software Technology, University of Swat, Swat 19120, Pakistan

²

Department of Applied Informatics, Fo Guang University, Yilan 262307, Taiwan

³

Department of Artificial Intelligence, Tamkang University, New Taipei 251301, Taiwan

^*

Author to whom correspondence should be addressed.

Information 2025, 16(5), 369; https://doi.org/10.3390/info16050369

Submission received: 28 February 2025 / Revised: 21 April 2025 / Accepted: 28 April 2025 / Published: 30 April 2025

(This article belongs to the Special Issue Sensing and Wireless Communications)

Download

Browse Figures

Versions Notes

Abstract

Applications of wireless sensor networks have significantly increased in the modern era. These networks operate on a limited power supply in the form of batteries, which are normally difficult to replace on a frequent basis. In wireless sensor networks, sensor nodes alternate between sleep and active states to conserve energy through different methods. Duty cycling is among the most commonly used methods. However, it suffers from problems like unnecessary idle listening, extra energy consumption, and packet drop rate. A Deep Reinforcement Learning-based B-MAC protocol called (RL-BMAC) has been proposed to address this issue. The proposed protocol deploys a deep reinforcement learning agent with fixed hyperparameters to optimize the duty cycling of the nodes. The reinforcement learning agent monitors essential parameters such as energy level, packet drop rate, neighboring nodes’ status, and preamble sampling. The agent stores the information as a representative state and adjusts the duty cycling of all nodes. The performance of RL-BMAC is compared to that of conventional B-MAC through extensive simulations. The results obtained from the simulations indicate that RL-BMAC outperforms B-MAC in terms of throughput by 58.5%, packet drop rate by 44.8%, energy efficiency by 35%, and latency by 26.93%

Keywords:

wireless sensor networks; machine learning; energy optimization; deep reinforcement learning; MAC protocols

1. Introduction

The proliferation of wireless communication technologies has resulted in an unparalleled surge in the quantity of interconnected sensor nodes and the need for effective communication protocols. There has been a significant increase in the deployment of WSNs over the last decade in deployment of Wireless Sensor Networks (WSNs) in various application areas such as industrial automation, healthcare, and environmental monitoring [1]. The purpose of these networks is to collect data in real time to provide useful services [2]. The primary challenges faced by WSNs include scalability and coverage, security, mobility management, and energy efficiency. Among the aforementioned issues, energy efficiency is one of the critical issues that has attracted significant research attention. Various techniques such as energy balancing, hierarchical communication, and duty cycling have been proposed to enhance sustainability and extend network lifetime. One of the most common techniques used for energy conservation in WSNs is duty cycling. Duty cycling is a technique in which the sensor nodes are allowed to sleep periodically. This is utilized to reduce idle listening, avoid collisions and retransmissions, and minimize overhearing [3]. Efficient low-power Medium Access Control (MAC) protocols are used to minimize the energy consumption of nodes by efficiently utilizing network resources. Berkley MAC (B-MAC) is a widely used low-power MAC protocol and is an extended version of the S-MAC protocol [4]. It is an asynchronous, contention-based MAC protocol that utilizes adaptive duty cycling to configure the sleep and wake-up schedule of nodes according to changing network conditions. The B-MAC protocol has become popular because of its suitability for energy-constrained nodes in WSNs due to its low-power properties [5]. However, B-MAC protocols have several limitations, such as a lack of reconfiguration of nodes based on dynamic network conditions, long preambles, extra control packet overhead, and inefficient duty cycling when nodes have no data to send or receive.

Alternative approaches are required to address these limitations for dynamic network conditions. Machine learning algorithms have the potential to address the aforementioned issues by learning from the environment and making decisions based on real-time network conditions. Deep Q-Network (DQN) is one of the machine learning algorithms that is an effective and emerging approach to enhance network performance by making autonomous decisions. Due to the model-free capability of DQN, such decision-making in dynamic network conditions becomes achievable. It allows the sensor nodes to adapt to the changing network conditions in terms of energy levels and duty cycling. These capabilities enable us to devise algorithms for adjusting the sleep and wake-up schedule in WSNs to optimize energy efficiency [6,7]. We propose a Deep Reinforcement Learning-based B-MAC protocol called (RL-BMAC) that exploits DQN to dynamically adjust the sleep and wake-up schedules of nodes based on real-time network conditions. The proposed protocol determines and analyzes the performance parameters such as throughput, packet drop rate, and energy consumption, to optimize the sleep and wake-up schedules of the nodes while maintaining the reliability of the network. It continuously monitors the network-relevant parameters of the B-MAC for reconfiguration of the scheduling. The performance of the proposed approach is analyzed using extensive simulations, and the results are derived in terms of throughput, packet drop rate, network lifetime, and energy consumption.

The key contributions of this study are as follows:

Proposing an optimal approach to enhance the performance of the B-MAC protocol through the integration of intelligent sleep scheduling.
Exploiting DQN to dynamically optimize the sleep and wake-up schedule of nodes based on real-time network conditions.
Demonstrating the effectiveness of the proposed approach through extensive simulations, showcasing substantial performance improvement while maintaining reliable communication.

The subsequent sections of this work are structured in the following manner: Section 2 presents related work on machine learning-based MAC protocols. Section 3 presents the proposed protocol in detail. Section 4 presents the simulation environment and results. Section 5 concludes our work.

2. Related Work

The literature in this field has witnessed diverse contributions addressing different aspects of energy-efficient protocols and machine learning applications in WSNs. There are several MAC protocols designed to address the issue of energy efficiency in WSNs. Table 1 highlights the related work, including additional protocols related to the integration of RL in WSNs. S-MAC [3] is one of the first MAC protocols for energy efficiency in WSNs. It optimizes energy efficiency by introducing periodic sleep and wake-up schedules for the nodes. It utilizes sleep synchronization to reduce idle listening. It also reduces the overhead by employing message passing. The protocol, however, is not suitable for high-latency networks due to its fixed sleep scheduling. The authors of [8] introduced an updated version of S-MAC by introducing a type of adaptive duty cycling called T-MAC. This was achieved by using an adaptive time interval. In this approach, the nodes go to sleep if no sending or receiving activity is detected during a certain time duration. This resulted in a significant reduction in energy consumption. However, the protocol suffered a premature sleep transition, resulting in an increased packet drop rate. The authors of [9] presented another energy-efficient MAC protocol. They added the feature of preamble sampling. In this protocol, the sensor nodes do not rely on a synchronized sleep schedule but actually utilize preamble sampling to wake up independently. Idle listening and energy consumption are optimized by using an optimal preamble length. The extra overhead of frequently adjusting the schedule and preamble length is a limitation of this protocol. X-MAC is another energy-efficient protocol presented in [10], which optimizes preamble sampling using short wakeup signals along with listening duration, called strobed preambles. This technique helped with latency reduction and energy optimization because the nodes would respond to received data earlier. The balance between energy consumption and reduced latency was significantly achieved through this protocol. LMAC is based on the Time Division Multiple Access (TDMA) protocol, but unlike other TDMA-based protocols, the slot assignment is performed dynamically in LMAC. This helps reduce contention and packet collision. The protocol suffers from synchronization overhead when dynamically assigning slots [11]. Z-MAC is presented in [12], which is a hybrid protocol and incorporates the options of both TDMA and Carrier Sense Multiple Access (CSMA) to adjust different types of network loads. It uses TDMA in normal scenarios but changes to CSMA when there is heavy traffic load. However, while performing these reconfigurations, it suffers from the extra overhead of global synchronization to manage the schedule. The authors in [13] presented the efficiency of the node probabilistic sleep scheduling in epidemic spreading models to restrain epidemic spreading in WSNs. Apart from this, distributed sleep scheduling algorithms have also been tested through realistic simulation, and their results show improved network lifetime and other network parameters like throughput, etc., keeping in mind minimizing the communication overhead. This is achieved by balancing the node sleep schedule and probabilities of their activation to prolong network lifetime [14]. There is some state-of-the-art research on energy-efficient protocols in low-power networks involving machine learning in general and RL specifically. Bebortta et al. [15] provided empirical evidence supporting the occurrence of power-law dynamics in cognitive Internet of Things (IoT) applications. The present study evaluated the inherent capacity to comprehend the statistical attributes of IoT data, thereby offering valuable insights into the underlying patterns of communication. Ruan et al. [16] investigated energy-conscious methodologies specifically designed for wireless sensor nodes powered by energy harvesting. This research was centred on examining approaches to enhance energy efficiency in nodes equipped with the capability to extract energy from the surrounding environment. Mukherjee et al. [17] developed a cloud-based healthcare model with an improved k-nearest neighbor classifier for the early diagnosis of COVID-19. Magsi et al. [18] proposed an innovative adaptive algorithm with consideration for battery life for use in IoT-based healthcare data transfer. The study’s primary goal was to improve data transfer methods to extend the useful life of medical device batteries. Sodhro et al. [19] presented an adaptive approach for controlling transmission power in wireless body area networks to reduce energy consumption. The goal of the research was to improve the efficiency of nodes as a whole through the use of variable transmission power. Sandhu et al. [20] researched and analyzed scheduling methods for IoT devices that run on energy harvested from the environment. This study highlights the importance of task scheduling in energy harvesting. It also highlights different sorts of difficulties faced while task scheduling.

Kosunalp [28] suggested a novel Q-learning-based technique for energy prediction in WSNs that can harvest power. Sharma et al. [29] researched the state of solar energy harvesting for sensor nodes in WSNs. Solar energy collecting technologies and their use in WSNs were the focus of this investigation. Kumar et al. [30] offered a survey on machine learning algorithms for WSNs. This study surveyed numerous machine learning strategies used to improve WSNs’ functionality. Alsheikh et al. [31] analyzed how machine learning approaches can be beneficial, in general, for WSNs. Their work outlined most of the major ML algorithms, techniques, and approaches intended to enhance different performance parameters of sensor networks. Sharma et al. [32] focused on developing smart and flexible sensor networks for urban areas. Their study aimed at integrating WSNs in smart cities. The study in [33] developed ReLeC, a protocol that combines reinforcement learning with clustering techniques to plan and schedule node activities, thereby extending network lifetime and reducing energy consumption. In the realm of resource-constrained environments, an RL-based protocol synthesis framework tailored for ultra-low complexity wireless transceiver is proposed in [34]. The approach leverages multi-armed bandit algorithms to facilitate efficient medium access without relying on complex operations like carrier sensing or network synchronization. Keshtiarast et al. in [35] introduced a multi-agent deep reinforcement learning framework that enables individual network nodes to autonomously learn and optimize their MAC strategies based on local observations, demonstrating superior performance over traditional protocols in various scenarios. Focusing on quality of service (QoS) alongside energy efficiency, Kim et al. in [36] presented a receiver-initiated MAC protocol that utilizes reinforcement learning to adjust the wake-up cycles of receiver nodes based on packet characteristics and residual energy, resulting in increased network lifetime and reduced transmission delays. In the context of wireless multimedia sensor networks (WMSNs), Joshi et al. implemented a SARSA-based reinforcement learning protocol to manage the energy consumption of multimedia data transmission, achieving significant improvements over existing protocols [37]. These studies collectively underscore the potential of reinforcement learning techniques in enhancing the energy efficiency and overall performance of MAC protocols in WSNs.

3. Proposed Protocol

In this section, we describe the procedures and techniques that are used to integrate smart sleep scheduling in the B-MAC protocol using DQN. The step-by-step discussion covers network initialization, the deployment of DQN agents, the formulation of state representations, action selection, and the entire learning process. The following sub-sections present the operation of the B-MAC and the RL-BMAC protocols.

3.1. B-MAC

B-MAC is an energy-efficient and lightweight MAC protocol that utilizes the capabilities of low-power listening. B-MAC performs duty cycling through adaptive preamble sampling, which minimizes the duty cycle by reducing idle listening. This adaptation is based on energy levels. It also calculates energy consumption based on packet length and rate of packet generation [4]. Figure 1 illustrates the basic workings of duty cycling and preamble sampling in the standard B-MAC protocol.

Though standard B-MAC has the capability to adjust the duty cycle depending on certain conditions, it still suffers from certain limitations. Firstly, it suffers from energy consumption overhead when long preambles are sent with long check intervals. Additionally, B-MAC has fewer synchronization and state requirements, but it suffers from energy optimization at low data rates. Moreover, B-MAC performs complex reconfiguration techniques, which become an overall overhead for the operations, consuming more energy [4].

3.2. RL-BMAC

Figure 2 presents an abstract view of our proposed model, while the detailed workings of RL-BMAC are shown in Figure 3 and Algorithm 1. This method comprises several major steps, with each step having in-depth sub-steps. The major steps include initiating the network and Q-learning agent deployment. The sensor network initially uses the standard B-MAC protocol with its initial duty cycle. After some episodes, network parameters such as energy level, packet drop rate, neighbor nodes’ status, and preamble sampling are collected and provided to the Q-network to determine an optimized sleep and wake-up schedule, thus making it dynamic. Episode-specific variables are set to their default values, and sleep and wake-up schedules for each node are created at the beginning of each episode. The protocol then iterates over each time step, and nodes in the network make decisions based on the Q-learning agent’s recommendations. A detailed description is provided as follows. Network initialization is the very first and basic step where multiple network parameters are specified. These parameters include a number of nodes, channel characteristics, neighboring nodes’ status, and other network conditions. Every node in the network has its initial energy level, which represents the available energy for communication. Additionally, these nodes also have an initial queue of data packets waiting to be transmitted.

The overall network initialization can be represented as follows:

Network Initialization = {E_{i}, Q_{i}, C_{i}, N_{i}}

(1)

where:

E_{i}

: Initial energy level of the node

Q_{i}

: Initial queue of data packets at the node

C_{i}

: Initial channel conditions

N_{i}

: Number of neighboring nodes

This set includes all the parameters and initial conditions necessary for the subsequent operation of the RL-BMAC.

Algorithm 1: RL-BMAC: Modified MAC Protocol with Deep RL

Deep Q-Network Agent Deployment in RL-BMAC

We show how the proposed protocol is implemented with the DQN agent. Making smart choices regarding node actions based on observed states is a major responsibility of the DQN agent. This agent is deployed within each sensor node to autonomously manage the sleep and wake-up schedule.

The DQN agent allows direct access to local information, including energy levels, preamble sampling, and neighboring node status. The agent uses this information to make decisions regarding the sleep and wake-up schedule and resource allocation. The chosen hyperparameters in DQN, namely learning rate, epsilon, epsilon decay, epsilon minimum, batch size, and replay memory size, affect model learning stability and both convergence speed and overall performance. The learning rate influences weight update speed in the network, so high values risk divergence but low values slow convergence. Epsilon serves as an active component together with epsilon decay and epsilon minimum for maintaining the exploration–exploitation balance. The initially high epsilon value promotes exploration, while minimum values, together with epsilon decay, allow exploration to transition more toward exploitation during agent learning. Inappropriate selection of epsilon values leads agents toward either sluggish convergence because they explore insufficiently or inefficient learning from excessive exploration time. The selected batch size determines gradient update stability and variation rates and can produce better results at the expense of heightened computational efforts. The size of replay memory controls the success rate at which stored experiences help learning, but either small capacities create overfitting of recent entries or large capacities produce learning from irrelevant past data. The nodes communicate with each other to share relevant information that influences the decision-making process of the DQN agent. Nodes keep themselves updated about the changing conditions of the network by exchanging information such as packet drop rate, preamble sampling results, and energy level. We apply a Single-Agent DQN algorithm that employs a neural network function approximator to estimate the action value function

Q (s, a)

, where s depicts the state and a denotes the action. The Q-network converges on the optimal action value function by minimizing the Temporal Difference (TD) error. The difference between the predicted and actual Q-values is the error that is minimized while training. The in-depth mechanism of DQN is shown through Figure 3.

The Q-network is the predominant design that consists of several lower-connected layers that use Rectified Linear Unit (ReLU) activation functions. The input layer receives the state vector s, which consists of elements such as energy level, neighbor nodes’ status, packet drop rate, and preamble sampling results. Every layer that is fully connected with the other input layers trains the feature and passes it to the next layer, one at a time. Choosing the actions a at each timestep is the task of the agent based on the

ε

-greedy policy, where a combination of exploration and exploitation is needed. The agent picks the action with the maximum Q-value for the current state s, with the probability of

1 - ε

to allow for exploration;

ε

being the exploration rate. The core of the presented DQN model is a neural network that effectively processes the state and action spaces of our problem. The network starts with an input layer that processes the feature space of the state, which consists of several dimensions: energy level, queue status, channel conditions, neighbor status, packet drop rate, and preamble sampling. This layer does not contain any normalization process, as it is processed in the subsequent layers as the raw input features. There are two fully connected hidden layers, each containing 24 neurons in the hidden layers of the network. These layers employ the ReLU activation function, which is very useful for non-linearity and also ensures that only positive values are forwarded to the next layers, i.e.,

R e L U (x) = m a x (0, x)

. The choice to have two hidden layers, both comprising 24 neurons, serves to make the model capable of learning the details without overfitting the model. On the output layer of the network, Q-values are generated for each action by the network. This output layer has two neurons and employs a linear activation function for the possible actions, i.e., to transmit data or sleep. This choice of activation function has specific reasons: the output values have to be continuous so that they can correspond to the expected returns regarding the chosen action. When training the DQN model, several hyperparameters determine the learning process. The learning rate is set to an optimal value, which is useful for regulating the step size during the updates of weights. It achieves a balance in guaranteeing stability, convergence, and at the same time enhancing learning efficiency. The discount factor,

γ

, is set to an optimal value, placing more emphasis on the near future rewards as compared to the distant future rewards. For exploration of the action space, the value of epsilon

ϵ

is initially set to 1.0. The value is initially kept high to ensure randomness in exploring the actions by the agent. Gradually, over time, it decays by a factor of an optimal value per episode, stopping at a minimum value of 0.02. This strategy ensures a gradual shift in the policy from exploration to exploitation. The experiences are drawn from replay memory with a size of 5000, which is the maximum. The experience consists of an action, a state, the upcoming state, and a reward. The learning process is improved and stabilized by randomly selecting the experiences. To balance the learning stability and computational efficiency, a batch size of 32 is used. Adjusting network weights is another aspect of neural networks for which different optimizers are used. For the RL-BMAC, Adam optimizer is selected as it provides suitable adaptive learning rate properties [38]. As far as the training is concerned, it is composed of target Q-value calculation and experience replay. The experiences of the agent are stored in replay memory. The experiences are sampled to update the network. The network’s learning process is guided by the target value, which is the result of the received reward and the maximum predicted Q-value of the next state. For the loss function, Mean Squared Error (MSE) between target Q-values and predicted Q-values is used. MSE guides the weight updates when training is performed. The Q-learning update rule is employed to update the Q-network parameters iteratively. Given a transition

(s, a, r, s^{'})

, where r is the reward and

s^{'}

is the next state, the temporal difference error [23] is computed as follows:

δ = r + γ max_{a^{'}} Q (s^{'}, a^{'}) - Q (s, a)

(2)

where

δ

represents the temporal difference (TD) error, r is the immediate reward,

γ

is the discount factor,

Q (s, a)

is the current action value function, and

{max}_{a^{'}} Q (s^{'}, a^{'})

denotes the maximum estimated future reward in the next state

s^{'}

over all possible actions

a^{'}

.

Minimizing the mean-squared TD error serves to update the parameters of the Q-network in the following manner [39]:

L = E [{(δ)}^{2}]

(3)

where L represents the loss function,

E [\cdot]

denotes the expectation operator, and

δ

is the TD error.

This learning process enables the agent to improve its decision-making over time. The DQN agent influences the sleep and wake-up schedule of its respective node based on the learned policy. If the agent determines that a node should enter a sleep state, it communicates this decision to the node’s sleep and wake-up scheduler, which then adjusts the node’s operational state accordingly. In summary, the Q-learning agent is initialized with hyperparameters that govern its learning behavior. These hyperparameters include the learning rate (

α

), discount factor (

γ

), exploration–exploitation trade-off parameter (

ϵ

), and others. The initialization is performed as follows [23]:

Initialize Q - learning Agent : Agent \leftarrow DQL (α, γ, ϵ, \dots)

(4)

where

α

,

γ

, and

ϵ

are hyperparameters. The agent observes the current state of each node, which is represented by various parameters such as energy level (E), neighbor nodes’ status (N), packet drop rate (P), and preamble sampling results (

P S

). The RL-BMAC protocol defines its state representation via

S = Encode (E, N, P, P S)

which incorporates E representing the node energy status combined with N for nearby node statuses and P for packet drops alongside

P S

for preamble sampling. These parameters collectively enable the detection of critical network conditions within the environment. An integrated energy-level component enables agents to make informed decisions, which helps extend the operational sustainability of the network. By considering neighbor status information, the agent can understand the local network structure together with possible interference to make more informed decisions that reduce collisions while enhancing coordination effectiveness. The network reliability and congestion state are assessed through packet drop rates, which lead the agent to develop transmission scheduling policies for maximum efficiency. The outcome of preamble sampling reveals active or inactive channels, which helps the agent detect unutilized communication time, resulting in more effective energy management. With these parameters, the DQN agent becomes capable of developing optimal sleep/wake behavior adjustments that adapt to varying network environments for sustained energy efficiency, along with reliable data transmission. The encoding function encapsulates relevant information about the node’s environment. In its action space, the agent allows the nodes to transmit data or go into sleep mode. The learned policy and the current condition are used to determine the next step; these are represented as

Choose an Action : Action = Agent . ChooseAction (S)

(5)

The agent’s decision-making is influenced by the learned Q-values associated with different actions. After an action is taken, the protocol monitors the percentage of dropped packets to determine the agent’s reward. The goal of the reward system is to reduce packet loss and increase energy efficiency. Each node keeps a record of its present state at each time step within an episode. Critical metrics like energy level, packet drop rate, preamble sampling, and neighboring node status are included in this observation. DQN agents make decisions based on this information, which is encoded into a representative state. The DQN agent allows nodes to choose from a set of predetermined actions, such as sending data or going to sleep. The selected action is then carried out, which results in changing the state of the remaining energy and queue status. The knowledge base and decision-making capability of DQN are regularly updated. This is because DQN is capable of learning from previous experiences. It modifies its Q-values based on what it observes in terms of actions, states, and rewards. This enables the nodes to adjust their parameters according to dynamic network conditions. The integration of a DQN agent within the RL-BMAC protocol introduces a non-negligible computational overhead, which is particularly relevant in the context of resource-constrained wireless sensor nodes. The overhead primarily arises from maintaining and updating the deep neural network, performing forward passes to select actions, and executing backpropagation during the learning phase. These operations require processing power and memory resources that are typically limited on sensor nodes. Though the protocol suffers extra computational overhead, the performance achieved in terms of energy efficiency, reduced packet drop rate, and optimized sleep-wake scheduling significantly outweighs the associated costs. By dynamically adjusting the duty cycle based on context-aware inputs like energy level, neighbor status, and channel conditions, the protocol reduces long-term energy consumption even if per-decision computation is higher.

4. Simulation Results and Discussion

This section presents the simulation setup and performance of RL-BMAC and standard B-MAC protocols. The simulations are conducted in MATLAB R2024a, which is a state-of-the-art and widely used simulation tool. MATLAB is widely used for the performance analysis of several protocols. Simulations are conducted in a network environment with a fixed number of nodes, and the key parameters used by the RL-BMAC and B-MAC are varied for performance analysis. The results are presented in terms of throughput, packet drop rate, network lifetime, energy consumption, and latency. Table 2 shows the basic parameters used in the simulations.

4.1. Throughput with Respect to Number of Nodes

This simulation shows the throughput of RL-BMAC and B-MAC in terms of bits successfully transferred throughout each simulation for different numbers of sensor nodes and packet sizes. Figure 4 presents the throughput of RL-BMAC and B-MAC as a function of several nodes for varying packet sizes, i.e., 256 bits, 512 bits, 1024 bits, and 2048 bits. For a packet size of 256 bits and different numbers of nodes, the average throughput improvement of RL-BMAC compared to that of B-MAC is approximately 56%, as shown in Figure 5a. Similarly, the improvement in the throughput of RL-BMAC for packet size of 512 bits, 1024 bits and 2048 bits is 61%, 53% and 64%, respectively, as shown in Figure 4b–d. The results clearly show that the proposed protocol outperforms the B-MAC algorithm by approximately 58.5%. Multiple factors contribute to such improvement of the proposed protocol. These factors include adaptive slot selection, dynamic learning, and efficient packet handling.

4.2. Packet Drop Rate with Respect to Number of Nodes

The packet drop rate is evaluated for varying numbers of nodes and packet sizes of 256 bits, 512 bits, 1024 bits, and 2048 bits, as shown in Figure 5. Analysis of packet drop rate is important for checking the reliability and efficiency of data transmission. For a packet size of 256 bits across different numbers of nodes, the average improvement of RL-BMAC over B-MAC is approximately 62%, as shown in Figure 5a. Similarly, for a packet size of 512 bits, 1024 bits and 2048 bits, the improvement in the packet drop rate of RL-BMAC is 33%, 50% and 30%, respectively, as illustrated in Figure 5b–d. The overall average improvement of RL-BMAC over B-MAC is derived from the difference between each packet drop rate at different nodes for both protocols, which is approximately 44.8%. This improvement is the result of a dynamic sleep and wake-up schedule and optimal preamble sampling. The RL-BMAC protocol always has a lower packet drop rate compared to that of B-MAC, thereby implying that it is capable of efficient utilization of resources in the network.

4.3. Network Lifetime with Respect to Number of Nodes

This simulation evaluates the network lifetime of RL-BMAC and B-MAC across varying numbers of nodes. The network lifetime is presented in time steps as a function of the number of nodes. The performance comparison shows the network lifetime values for RL-BMAC and B-MAC at five different node counts. The approximate percentage improvements at different node numbers are as follows: for 10 nodes, the improvement is 6.25%. For 20 nodes, RL-BMAC shows an improvement of 14.29%. For 30 nodes, the improvement is 15.38%. Similarly, for nodes 40 and 50, the average improvement is 21.43% and 26.09%, respectively. The average improvement across all evaluated scenarios is 16.69%, indicating a substantial enhancement in network lifetime when RL-BMAC is employed. The observed improvement in network lifetime can be attributed to the adaptive learning-based optimization and reduction in collisions. The results are presented in Figure 6.

4.4. Energy Consumption with Respect to Number of Nodes

This simulation investigated energy consumption for different numbers of nodes. As shown in Figure 7, the energy consumption of RL-BMAC and B-MAC increases as a function of several nodes; however, RL-BMAC consumes less energy compared to B-MAC for different node configurations. It can be seen that the energy consumption of B-MAC increases by approximately 2 milli joules (mJ), and the average energy consumed for 50 nodes is around 95 millijoules (mJ). Unlike B-MAC, RL-BMAC shows a gradual increase in energy consumption at a rate of 1.2 mJ per node, with an average energy consumption of 74 mJ for 50 nodes. The average improvement in RL-BMAC over B-MAC in terms of energy efficiency is approximately 35% for the whole network. As the nodes using RL-BMAC adjust their sleep and wake-up schedules based on network conditions, they do not spend extra time engaging in idle listening.

4.5. Packet Drop Rate with Respect to Packet Arrival Rate

This simulation presents the packet drop rate of RL-BMAC and B-MAC for different packet arrival rates represented in packets per second (pps). In this analysis, the number of nodes is fixed to 40. As shown in Figure 8, the packet drop rate of RL-BMAC at every packet arrival rate is significantly lower than that of B-MAC for different packet arrival rates. This is because, unlike B-MAC, the RL-BMAC optimizes resource allocation based on the network conditions. At 100 pps, RL-BMAC has a packet drop rate of 0.16, while that of B-MAC is 0.42. The overall improvement of RL-BMAC for all traffic loads is approximately 65%. These results suggest that RL-BMAC can minimize the packet drop rate at different traffic load conditions.

4.6. Latency with Respect to Number of Nodes for Varying Number of Hop Counts

This simulation presents the latency of RL-BMAC and B-MAC across varying hop counts and different numbers of nodes, as shown in Figure 9. As shown in Figure 9a, for hop count 1, the average latency reduction of RL-BMAC is 25.65% compared to that of B-MAC. For hop Count 2, it is 24.62%, as shown in Figure 9b. For hop counts 3 and 4, the average latency reduction of RL-BMAC is 30.40% and 27.07%, respectively, as shown in Figure 9c,d. The average latency improvement of RL-BMAC across all hop counts is 26.93%, which demonstrates the effectiveness of the RL-BMAC over the B-MAC protocol in terms of latency for delay-sensitive applications.

Table 3 presents the average performance improvement of RL-BMAC over the B-MAC protocol.

5. Conclusions

In this study, we developed RL-BMAC as a deep reinforcement learning-based protocol improvement of B-MAC, which optimizes duty cycling in wireless sensor networks. An intelligent RL agent built into the protocol can optimize network performance by adjusting node activity according to energy levels, packet drop rate, neighboring node status, and preamble sampling. Future work can explore real-world deployment scenarios and the integration of DRL with other low-power MAC protocols for performance improvement. Additionally, addressing dynamic interference and cross-layer optimizations could further improve network adaptability.

Author Contributions

Conceptualization, O.K. and S.U.; Methodology, O.K., S.U. and H.-C.C.; Validation, O.K.; Investigation, M.K.; Resources, H.-C.C.; Writing-original draft, O.K.; Writing-review & editing, S.U.; Visualization, M.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yadav, A.K.; Kumar, A. The Smart Analysis of Prolong Lifetime in Industrial Wireless Sensor Networks. In Proceedings of the 2023 International Conference on Distributed Computing and Electrical Circuits and Electronics (ICDCECE), Ballar, India, 29–30 April 2023; pp. 1–6. [Google Scholar] [CrossRef]
Adu-Manu, K.S.; Engmann, F.; Sarfo-Kantanka, G.; Baiden, G.E.; Dulemordzi, B.A. WSN Protocols and Security Challenges for Environmental Monitoring Applications: A Survey. J. Sens. 2022, 2022, 1628537. [Google Scholar] [CrossRef]
Ye, W.; Heidemann, J.; Estrin, D. An Energy-Efficient MAC Protocol for Wireless Sensor Networks. In Proceedings of the Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies, New York, NY, USA, 23–27 June 2002. [Google Scholar]
Polastre, J.; Hill, J.; Culler, D. Versatile low power media access for wireless sensor networks. In Proceedings of the 2nd International Conference on Embedded Networked Sensor Systems, Baltimore, MD, USA, 3–5 November 2004; pp. 95–107. [Google Scholar]
Ruan, Z.; Luo, H.; Xu, G. Time-Aware and Energy Efficient Data Collection Mechanism for Duty-Cycled Wireless Sensor Networks. In Proceedings of the Artificial Intelligence and Security, New York, NY, USA, 26–28 July 2019; Sun, X., Pan, Z., Bertino, E., Eds.; Springer: Cham, Switzerland, 2019; pp. 199–210. [Google Scholar]
Lee, H.; Kim, S. Optimizing LoRa Networks with Deep Reinforcement Learning. arXiv 2023, arXiv:2309.08965. [Google Scholar]
Sharma, A.; Gupta, P. Energy and Security Trade-off in Wireless Sensor Networks using DRL. Sensors 2024, 24, 1993. [Google Scholar]
van Dam, T.; Langendoen, K. An Adaptive Energy-Efficient MAC Protocol for Wireless Sensor Networks. In Proceedings of the 1st International Conference on Embedded Networked Sensor Systems, Los Angeles, CA, USA, 5–7 November 2003. [Google Scholar]
El-Hoiydi, A.; Decotignie, J.D. WiseMAC: An Ultra Low Power MAC Protocol for Multi-hop Wireless Sensor Networks. In International Symposium on Algorithms and Experiments for Sensor Systems, Wireless Networks and Distributed Robotics; Springer: Berlin/Heidelberg, Germany, 2004. [Google Scholar]
Buettner, M.; Yee, G.V.; Anderson, E.; Han, R. X-MAC: A Short Preamble MAC Protocol for Duty-Cycled Wireless Sensor Networks. In Proceedings of the 4th International Conference on Embedded Networked Sensor Systems, Boulder, CO, USA, 31 October–3 November 2006. [Google Scholar]
van Hoesel, L.; Havinga, P. A Lightweight Medium Access Protocol (LMAC) for Wireless Sensor Networks. In Proceedings of the INSS’04. Society of Instrument and Control Engineers (SICE), Tokyo, Japan, 22–23 June 2024. [Google Scholar]
Rhee, I.; Warrier, A.; Aia, M.; Min, J. Z-MAC: A Hybrid MAC for Wireless Sensor Networks. In Proceedings of the 3rd International Conference on Embedded Networked Sensor Systems, San Diego, CA, USA, 2–4 November 2005. [Google Scholar]
Blok, V. Sleep Scheduling for Enhancing the Lifetime of Three-Dimensional Heterogeneous Wireless Sensor Networks. In Proceedings of the CCF Conference on Computer Supported Cooperative Work and Social Computing, Taiyuan, China, 25–27 November 2022. [Google Scholar] [CrossRef]
Pulipati, B.; Pat, D.U.; Miryala, A.; Sathish, K.; Ravi Kumar, C.V. Energy aware multilevel clustering scheme with Wake-Up Sleep Algorithm in underwater wireless sensor networks. J. Eng. Appl. Sci. 2023, 18, 1547–1553. [Google Scholar] [CrossRef]
Beboratta, S.; Senapati, D.; Rajput, N. Evidence of power-law behavior in cognitive IoT applications. Neural Comput. Appl. 2020, 32, 16043–16055. [Google Scholar] [CrossRef]
Ruan, T.; Chew, Z.J.; Zhu, M. Energy-aware approaches for energy harvesting powered wireless sensor nodes. IEEE Sens. J. 2017, 17, 2165–2173. [Google Scholar] [CrossRef]
Mukherjee, R.; Kundu, A.; Mukherjee, I.; Gupta, D.; Tiwari, P.; Khanna, A.; Shorfuzzaman, M. IoT-cloud based healthcare model for COVID-19 detection: An enhanced k-nearest neighbour classifier based approach. Computing 2021, 105, 849–869. [Google Scholar] [CrossRef]
Magsi, H.; Sodhro, A.H.; Al-Rakhami, M.S.; Zahid, N.; Pirbhulal, S.; Wang, L. A novel adaptive battery-aware algorithm for data transmission in IoT-based healthcare applications. Electronics 2021, 10, 367. [Google Scholar] [CrossRef]
Sodhro, A.H.; Li, Y.; Shah, M.A. Energy-efficient adaptive transmission power control for wireless body area networks. IET Commun. 2016, 10, 81–90. [Google Scholar] [CrossRef]
Sandhu, M.M.; Khalifa, S.; Jurdak, R.; Portmann, M. Task scheduling for energy-harvesting-based IoT: A survey and critical analysis. IEEE Internet Things J. 2021, 8, 13825–13848. [Google Scholar] [CrossRef]
Shah, R.C.; Rabaey, J.M. Energy aware routing for low energy ad hoc sensor networks. In Proceedings of the IEEE Wireless Communications and Networking Conference, Orlando, FL, USA, 17–21 March 2002; IEEE: Piscataway, NJ, USA, 2002; Volume 1, pp. 350–355. [Google Scholar]
Rodoplu, V.; Meng, T.H. Minimum energy mobile wireless networks. IEEE J. Sel. Areas Commun. 1999, 17, 1333–1344. [Google Scholar] [CrossRef]
Watkins, C.J.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Konda, V.R.; Tsitsiklis, J.N. Actor-critic algorithms. In Proceedings of the Advances in Neural Information Processing Systems, Denver, CO, USA, 1 January 2000; pp. 1008–1014. [Google Scholar]
Rummery, G.A.; Niranjan, M. On-Line Q-Learning Using Connectionist Systems; Technical Report, Technical Report CUED/F-INFENG/TR 166; Cambridge University Engineering Department: Cambridge, UK, 1994. [Google Scholar]
Van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double Q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar]
Kosunalp, S. A new energy prediction algorithm for energy-harvesting wireless sensor networks with Q-learning. IEEE Access 2016, 4, 5755–5763. [Google Scholar] [CrossRef]
Sharma, A.; Kakkar, A. Forecasting daily global solar irradiance generation using machine learning. Renew. Sustain. Energy Rev. 2018, 82, 2254–2269. [Google Scholar] [CrossRef]
Kumar, D.P.; Tarachand, A.; Annavarapu, C.S.R. Machine learning algorithms for wireless sensor networks: A survey. Inf. Fusion 2019, 49, 1–25. [Google Scholar] [CrossRef]
Alsheikh, M.A.; Lin, S.; Niyato, D.; Tan, H.P. Machine learning in wireless sensor networks: Algorithms, strategies, and applications. IEEE Commun. Surv. Tutor. 2014, 16, 1996–2018. [Google Scholar] [CrossRef]
Sharma, H.; Haque, A.; Blaabjerg, F. Machine learning in wireless sensor networks for smart cities: A survey. Electronics 2021, 10, 1012. [Google Scholar] [CrossRef]
Sharma, T.; Balyan, A.; Nair, R.; Jain, P.; Arora, S.; Ahmadi, F. ReLeC: A Reinforcement Learning-Based Clustering-Enhanced Protocol for Efficient Energy Optimization in Wireless Sensor Networks. Wirel. Commun. Mob. Comput. 2022, 2022, 3337831. [Google Scholar] [CrossRef]
Dutta, H.; Bhuyan, A.K.; Biswas, S. Reinforcement Learning for Protocol Synthesis in Resource-Constrained Wireless Sensor and IoT Networks. arXiv 2023, arXiv:2302.05300. [Google Scholar]
Keshtiarast, N.; Renaldi, O.; Petrova, M. Wireless MAC Protocol Synthesis and Optimization with Multi-Agent Distributed Reinforcement Learning. arXiv 2024, arXiv:2408.05884. [Google Scholar] [CrossRef]
Kim, H.Y.; Kim, S.C. A Reinforcement Learning-Based Energy Efficient and QoS Supporting MAC Protocol for Wireless Sensor Networks. J. Southwest Jiaotong Univ. 2023, 58. [Google Scholar] [CrossRef]
Joshi, U.; Kumar, R. Reinforcement Learning Based Energy Efficient Protocol for Wireless Multimedia Sensor Networks. Multimed. Tools Appl. 2022, 81, 2827–2840. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]

Figure 1. Duty cycle and preamble sampling in standard B-MAC [4].

Figure 2. Abstract View of DQN.

Figure 3. RL-BMAC Architecure.

Figure 4. Throughput vs. number of nodes for various packet sizes.

Figure 5. Packet drop rate vs. number of nodes for various packet sizes.

Figure 6. Comparison of network lifetime for B-MAC and RL-BMAC.

Figure 7. Comparison of energy consumption for B-MAC and RL-BMAC.

Figure 8. Packet Drop Rate vs. Packet Arrival Rate.

Figure 9. Latency with respect to number of nodes for a varying number of hop counts.

Table 1. Survey of low-power MAC protocols for wireless sensor networks.

Protocol Used	Problem Addressed	Methodology	Limitations
TMAC Protocol [8]	Energy-efficient communication	Adaptive Duty Cycling through Adaptive Time Interval	Performance degradation in low-traffic scenarios due to early sleeping; can miss incoming transmissions.
S-MAC [3]	Energy conservation in sensor networks	Duty cycling for sleep synchronization to minimize idle listening	Fixed sleep schedules lead to increased latency and reduced throughput, especially under variable traffic loads.
EAR [21]	Energy-aware routing in ad hoc networks	Routing protocol with energy-based metrics for route selection	Inability to react to rapid topology changes due to static route assumptions; leads to sub-optimal path selection.
MECN [22]	Minimum energy communication in mobile wireless networks	Sub-network construction using localized algorithms	High control overhead and computational complexity; unsuitable for highly dynamic or dense topologies.
Q-learning [23]	Optimal energy routing in dynamic networks	Model-free RL approach for learning optimal routing policies	Slow convergence in large state-action spaces; performance deteriorates without sufficient exploration control
DQN [24]	Energy optimization under dynamic network conditions	DQN using neural networks for Q-value approximation	High resource consumption during training; requires a large replay buffer and is impractical for deployment on constrained nodes without simplification.
Actor–Critic [25]	Dynamic energy management	Model-free RL with separate value and policy networks	Susceptible to instability due to non-stationary targets; requires careful synchronization between actor and critic networks.
SARSA [26]	Energy-efficient routing in wireless networks	On-policy RL for learning state-action values directly	Slower convergence compared to Q-learning
Double DQN [27]	Mitigating overestimation in Q-learning	Improved DQN with separate networks to reduce overestimation bias	Tends to converge more slowly than off-policy methods; sensitive to exploration policy and may underperform in sparse reward settings.

Table 2. Simulation Parameters.

Parameter	Value
Queue Status	Initial queue empty
State Size	4 (Energy, Neighbors, Packet drop rate, Preamble sampling)
Action Size	2 ({Transmit data, Sleep})
Learning Rate	0.001
Discount Factor	0.95
Epsilon	1.0 (Initial)
Epsilon Decay	0.995
Epsilon Minimum	0.01
Batch Size	32
Maximum Memory Size	5000 (Replay Memory)

Table 3. Summary of experiments and results.

S. No.	Experiment	Average Performance Improvement of RL-BAC
1	Throughput Vs. Number of Nodes	58.5%
2	Packet Drop Rate with respect to Number of Nodes	44.8%
3	Network Life Time with respect to Number of Nodes	16.69%
4	Energy Consumption with respect to Number of Node	35%
6	Packet Drop Rate with respect to Packet Arrival Rates	65%
7	Latency with respect to Number of Hops	26.93%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Khan, O.; Ullah, S.; Khan, M.; Chao, H.-C. RL-BMAC: An RL-Based MAC Protocol for Performance Optimization in Wireless Sensor Networks. Information 2025, 16, 369. https://doi.org/10.3390/info16050369

AMA Style

Khan O, Ullah S, Khan M, Chao H-C. RL-BMAC: An RL-Based MAC Protocol for Performance Optimization in Wireless Sensor Networks. Information. 2025; 16(5):369. https://doi.org/10.3390/info16050369

Chicago/Turabian Style

Khan, Owais, Sana Ullah, Muzammil Khan, and Han-Chieh Chao. 2025. "RL-BMAC: An RL-Based MAC Protocol for Performance Optimization in Wireless Sensor Networks" Information 16, no. 5: 369. https://doi.org/10.3390/info16050369

APA Style

Khan, O., Ullah, S., Khan, M., & Chao, H.-C. (2025). RL-BMAC: An RL-Based MAC Protocol for Performance Optimization in Wireless Sensor Networks. Information, 16(5), 369. https://doi.org/10.3390/info16050369

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

RL-BMAC: An RL-Based MAC Protocol for Performance Optimization in Wireless Sensor Networks

Abstract

1. Introduction

2. Related Work

3. Proposed Protocol

3.1. B-MAC

3.2. RL-BMAC

Deep Q-Network Agent Deployment in RL-BMAC

4. Simulation Results and Discussion

4.1. Throughput with Respect to Number of Nodes

4.2. Packet Drop Rate with Respect to Number of Nodes

4.3. Network Lifetime with Respect to Number of Nodes

4.4. Energy Consumption with Respect to Number of Nodes

4.5. Packet Drop Rate with Respect to Packet Arrival Rate

4.6. Latency with Respect to Number of Nodes for Varying Number of Hop Counts

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI