A Scalable Reinforcement Learning Framework for Ultra-Reliable Low-Latency Spectrum Management in Healthcare Internet of Things

Iqbal, Adeel; Nauman, Ali; Khurshaid, Tahir; Rhee, Sang-Bong

doi:10.3390/math13182941

Open AccessArticle

A Scalable Reinforcement Learning Framework for Ultra-Reliable Low-Latency Spectrum Management in Healthcare Internet of Things

¹

School of Computer Science and Engineering, Yeungnam University, Gyeongsan-si 38541, Republic of Korea

²

Department of Electrical Engineering, Yeungnam University, Gyeongsan-si 38541, Republic of Korea

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2025, 13(18), 2941; https://doi.org/10.3390/math13182941

Submission received: 23 July 2025 / Revised: 18 August 2025 / Accepted: 9 September 2025 / Published: 11 September 2025

(This article belongs to the Special Issue Applied Mathematics in Artificial Intelligence: Methods, Algorithms, and Applications)

Download

Browse Figures

Versions Notes

Abstract

Healthcare Internet of Things (H-IoT) systems demand ultra-reliable and low-latency communication (URLLC) to support critical functions such as remote monitoring, emergency response, and real-time diagnostics. However, spectrum scarcity and heterogeneous traffic patterns pose major challenges for centralized scheduling in dense H-IoT deployments. This paper proposed a multi-agent reinforcement learning (MARL) framework for dynamic, priority-aware spectrum management (PASM), where cooperative MARL agents jointly optimize throughput, latency, energy efficiency, fairness, and blocking probability under varying traffic and channel conditions. Six learning strategies are developed and compared, including Q-Learning, Double Q-Learning, Deep Q-Network (DQN), Actor–Critic, Dueling DQN, and Proximal Policy Optimization (PPO), within a simulated H-IoT environment that captures heterogeneous traffic, device priorities, and realistic URLLC constraints. A comprehensive simulation study across scalable scenarios ranging from 3 to 50 devices demonstrated that PPO consistently outperforms all baselines, improving mean throughput by

6.2 %

, reducing 95th-percentile delay by

11.5 %

, increasing energy efficiency by

11.9 %

, lowering blocking probability by

33.3 %

, and accelerating convergence by

75.8 %

compared to the strongest non-PPO baseline. These findings establish PPO as a robust and scalable solution for QoS-compliant spectrum management in dense H-IoT environments, while Dueling DQN emerges as a competitive deep RL alternative.

Keywords:

5G; internet of things; priority-aware spectrum management; reinforcement learning; spectrum access; resource allocation

MSC:

68T05; 68M10

1. Introduction

The Internet of Things (IoT) has revolutionized modern healthcare by enabling continuous monitoring, precise diagnostics, and real-time interventions through interconnected sensors and devices, as shown in Figure 1. Within this paradigm, Healthcare Internet of Things (H-IoT) systems encompass diverse medical applications, ranging from vital signs monitoring and glucose tracking to emergency alerts and remote consultations, that demand ultra-reliable and low-latency communication (URLLC) [1].

The massive deployment of H-IoT devices in clinical environments, particularly in urban and hospital settings, has intensified spectrum scarcity, leading to contention among devices and degradation in Quality of Service (QoS). This risks life-critical data delivery. Traditional spectrum management schemes, often centralized and statically configured, struggle to meet the dynamic and priority-sensitive demands of medical IoT devices. They typically fail to adapt in real time to fluctuating traffic loads, heterogeneous service priorities, and rapid variations in wireless channel conditions. Moreover, the computational and signaling overheads of centralized methods make them unsuitable for large-scale, delay-sensitive healthcare scenarios [2,3].

To address these limitations, reinforcement learning (RL) has emerged as a robust paradigm for adaptive spectrum management. RL enables devices to learn efficient resource allocation policies by interacting with the environment and autonomously adapting to varying network states. Unlike conventional methods, RL algorithms are inherently suitable for multi-objective optimization, balancing competing objectives such as throughput maximization, latency minimization, energy efficiency, and fairness [4].

This work introduces a scalable multi-agent reinforcement learning (MARL) framework for priority-aware spectrum management (PASM) in H-IoT systems operating under URLLC constraints. Building on the PASM model, we evaluate a comprehensive set of learning strategies, including Q-Learning [5], Double Q-Learning [6], Deep Q-Network (DQN) [7], Actor–Critic [4], Dueling DQN [8], and Proximal Policy Optimization (PPO) [9]. The choice of these six schemes is deliberate: tabular methods serve as lightweight baselines with low computational cost, deep value-based methods capture complex non-linear dynamics, and policy-gradient methods enhance stability and scalability. By benchmarking these complementary schemes within a unified framework, MARL-PASM highlights their trade-offs and addresses a key gap in the literature where most studies focus on evaluating a single algorithm in isolation.

Our objective is to design and benchmark scheduling policies that jointly improve throughput, delay, fairness, and energy efficiency under realistic traffic and channel conditions while remaining computationally practical for real-time H-IoT deployments. The key contributions of this work are as follows:

We introduce MARL-PASM, a scalable multi-agent reinforcement learning framework for priority-aware spectrum management in H-IoT systems, explicitly designed to meet URLLC requirements by balancing throughput, delay, energy efficiency, and fairness.
We provide a comprehensive benchmarking study by implementing six representative RL strategies: Q-Learning, Double Q-Learning, DQN, Actor–Critic, Dueling DQN, and PPO within a unified PASM framework. This allows systematic comparison across tabular, deep value-based, and policy-gradient methods.
We integrate fairness-aware reward modeling and dynamic class-based prioritization into the framework, enabling adaptive and priority-sensitive scheduling under heterogeneous traffic and realistic wireless channel conditions.
We conduct extensive simulations across varying network sizes (3 to 50 devices) and heterogeneous traffic distributions, reporting consolidated results on throughput, delay, energy efficiency, fairness, blocking probability, convergence behavior, and training cost. These results reveal algorithmic trade-offs and identify PPO as the most promising scheme for large-scale H-IoT deployments.

The rest of the manuscript is organized as follows. Section 2 reviews recent advances in H-IoT spectrum management and RL applications in wireless networks. Section 3 presents the system model, including network assumptions, state–action representation, and reward design. Section 4 details the proposed MARL-PASM framework and learning strategies. Section 5 outlines the simulation setup and parameters. Section 6 analyzes results, interprets key findings, and provides future research directions. Finally, Section 7 concludes the paper.

2. Literature Review

H-IoT networks are designed to support real-time monitoring, emergency response, and intelligent diagnostics through the deployment of connected medical sensors and wearable devices. These networks handle mission-critical data streams, ranging from continuous vital sign monitoring to emergency alerts, and are foundational to smart healthcare delivery systems. However, achieving the stringent QoS requirements such as URLLC poses serious challenges due to spectrum scarcity, dynamic traffic patterns, and interference in densely deployed environments [10,11].

5G is explicitly designed to address these challenges through support for URLLC, offering air-interface latencies as low as 1 ms and reliability guarantees exceeding

99.999 %

in 3GPP Release 16 [12]. However, traditional static or semi-static spectrum management schemes fall short in coping with the complexity and dynamicity of H-IoT traffic. Therefore, intelligent and adaptive solutions, particularly those leveraging RL, have gained increasing attention as they provide the ability to autonomously learn optimal access and scheduling policies in real-time.

RL is now an emerging paradigm for Dynamic Spectrum Access (DSA) to allow devices and network agents to learn to select spectrum channels autonomously using adaptive feedback-based selection. Q-Learning has been employed in previous work to equip devices with the ability to learn channel access within cognitive radio environments [13,14]. These model-free RL algorithms improved the utilization of the spectrum by allowing devices to learn interference-aware access patterns over time. Current advances have also aimed at deep reinforcement learning (DRL) to leverage deep neural networks to control the high-dimensional state and action spaces typical in IoT environments. The authors in [15,16] proposed DRL-based decentralized architectures that significantly improved spectrum sharing efficacy without relying on central control. Such advances are relevant to H-IoT, wherein distributed healthcare devices ought to be autonomous to operate under congested and interference-heavy conditions. MARL techniques have also increasingly been applied in healthcare. The authors in [17] introduced a bi-level MARL framework using cooperative learning and game theory to enhance fairness and system throughput. Cooperative MARL models such as these have been successful in large cognitive radio networks [18], validating the scalability of RL in H-IoT settings. The hierarchical and federated RL is developed in [19], providing a privacy-protecting federated RL for IoMT systems with UAV support. Such approaches enable spectrum allocation with zero intrusion into sensitive healthcare data, paving the way for secure, scalable H-IoT systems.

Besides reliability and throughput, fairness and energy efficiency are critical metrics in H-IoT networks, especially for battery-powered medical sensors. DRL solutions like [20,21] demonstrate that energy-efficient RL policies can double device lifetimes without compromising latency and reliability. These schemes control task offloading and transmission adaptively in real time to realize an optimal energy-delay trade-off. Another essential aspect is fairness. The study in [17] incorporated Jain’s Fairness Index directly into the reward function of a MARL system to ensure equitable bandwidth allocation among devices. But most current frameworks are oblivious to fairness, leading to low-priority medical devices being denied service. Future research must include fairness as one of the key optimization goals, along with latency and throughput. Ensuring URLLC in IoT and particularly in dynamic H-IoT environments is possibly the toughest objective. Actor–Critic and policy-gradient algorithms have been explored to meet probabilistic latency constraints [22]. The study in [23] has demonstrated the use of RL to enforce dynamic resource slicing for healthcare applications, and once more, this highlights the practicality of RL in meeting stringent QoS requirements in healthcare systems.

Advanced RL models such as Double DQN, Dueling DQN, and Actor–Critic methods have started to gain visibility in the wireless communication domain. These architectures offer advantages such as reduced overestimation bias, better value function approximation, and improved stability. The study in [24] demonstrated that Dueling DQN outperformed both vanilla and Double DQN in 5G network slicing scenarios. Such architectures are underexplored in healthcare-specific spectrum management. Existing H-IoT works primarily rely on conventional RL algorithms, and there is minimal integration of advanced methods tailored to the healthcare context. This presents new opportunities to adopt and adapt state-of-the-art RL techniques in H-IoT systems.

Other relevant works include the distributed RL-based spectrum allocation framework in [25], which applies a distributed multi-agent approach to cognitive IoT environments. This method improves scalability and adaptability in dynamic spectrum settings but does not explicitly address medical QoS requirements such as URLLC. Similarly, Ref. [26] proposed an RL-based routing approach for cognitive radio-enabled IoT communications, focusing on optimal route selection and interference mitigation. While effective in improving network throughput, the scheme overlooks heterogeneous device constraints and priority-based scheduling needed in H-IoT.

As summarized in Table 1, most prior studies optimize isolated performance metrics and rely on simplified scenarios, limiting their applicability to dense and heterogeneous H-IoT deployments. To bridge these gaps, this study proposes a unified MARL-based PASM framework that jointly evaluates throughput, delay, energy efficiency, fairness, and blocking probability under URLLC demands, offering a realistic and scalable benchmark across six reinforcement learning strategies.

3. System Model

This section outlines the system model for the proposed RL-based spectrum management framework in an H-IoT environment, as shown in Figure 2. The architecture consists of multiple parallel channels shared among heterogeneous H-IoT devices operating in discrete time slots. The model encapsulates the network dynamics, state–action representation, and reward design to guide RL agents in decision-making.

We consider a time-slotted wireless communication environment with C orthogonal channels and N H-IoT devices. Each channel

c \in {1, 2, \dots, C}

supports one active transmission per slot. Devices are classified into three distinct priority classes, denoted by the set

K = {EmergencyAlert, GlucoseMonitor, FitnessTracker}

, indexed as

k \in {1, 2, 3}

respectively, with EmergencyAlert traffic having the highest priority. At each time slot

t \in {1, 2, \dots}

, one device sends a request for spectrum access. The request includes the device class, identity, and current context. The system observes this request and must decide on an appropriate action governing spectrum allocation. Let the system state at time t be denoted as

s_{t} \in S

, defined as a tuple:

s_{t} = (c_{t}, k_{t}, a_{t - 1}),

(1)

where the

c_{t} \in {1, \dots, C}

is the current channel index selected for access,

k_{t} \in {1, 2, 3}

denotes the priority class of the requesting device, and

a_{t - 1} \in A

is the action taken in the previous time step. The action space

A

comprises five discrete actions:

A = {Deny, Grant, Preempt, Coexist, Handoff},

(2)

where the action “Deny” rejects the spectrum access request; “Grant” approves the request and allocates the channel exclusively. “Preempt” revokes the access of a lower-priority device to grant the current request. The “Coexist” action is modeled as concurrent channel use with reduced SINR, activated when coexistence is permitted in the environment configuration. Finally, “Handoff” migrates the device to a different available channel. The total state space grows as

| S | = C \times | K | \times | A |

.

Reward Function

The reward function is designed to promote key objectives of H-IoT systems: high throughput, energy efficiency, fairness among device classes, and low latency for critical devices. Let

r_{t}

denote the scalar reward at time t, computed as

r_{t} = λ_{1} \cdot T_{t} + λ_{2} \cdot F_{t} - λ_{3} \cdot E_{t} + λ_{4} \cdot I (k_{t} = arg min_{k} C_{k}),

(3)

where

T_{t}

is the normalized throughput achieved in the current time step,

E_{t}

is the normalized energy cost of spectrum access.

I (\cdot)

is an indicator function providing a reward bonus if the currently served class is the least served so far, encouraging service diversity.

λ_{1}, λ_{2}, λ_{3},

and

λ_{4}

are weight parameters that balance the contribution of throughput, fairness, energy cost, and class equity, respectively, and

F_{t}

is the instantaneous fairness index, computed using Jain’s index [27]:

F_{t} = \frac{{(\sum_{k = 1}^{K} x_{k})}^{2}}{K \cdot \sum_{k = 1}^{K} x_{k}^{2}},

(4)

Here,

F_{t}

captures proportional fairness across device classes, while the indicator term provides an additional incentive to serve the least-attended class, ensuring diversity in scheduling decisions.

x_{k}

is the cumulative payload volume successfully delivered to class k across the episode, normalized by the number of time slots. Access counts and per-slot rewards are accumulated via environment feedback and aggregated at episode end. This reward formulation promotes efficient resource use while ensuring fairness and prioritizing underserved devices. The agent is trained to maximize the expected discounted cumulative reward [28]:

max_{π} E [\sum_{t = 0}^{\infty} γ^{t} r_{t} ∣ π],

(5)

where

π

is the policy mapping states to actions and

γ \in [0, 1)

is the discount factor.

4. Proposed MARL-PASM Framework

In this section, we detail the design of the proposed MARL-PASM framework for H-IoT environments. The framework integrates six RL schemes, namely tabular Q-Learning, Double Q-Learning, Actor–Critic, DQN, Dueling DQN, and PPO. These agents learn policies to dynamically allocate spectrum across heterogeneous medical devices with varying priorities. The overall structure of the framework is illustrated in Figure 3, while the generic training procedure that underpins all schemes is outlined in Algorithm 1.

Algorithm 1 Generic MARL–PASM training loop

1:: Initialize environment $E$ with N devices and traffic classes
2:: Initialize learner parameters $θ$ for the chosen scheme
3:: for episode $= 1$ to E do
4:: Reset environment and obtain initial state s
5:: for time step $t = 1$ to T do
6:: Select action a using exploration policy
7:: Apply a in $E$ , observe reward r, next state $s^{'}$
8:: Store $(s, a, r, s^{'})$ in buffer or trajectory memory
9:: Learner update:
10:: if Scheme is Q-Learning then update $Q (s, a)$ with TD rule
11:: else if Scheme is Double Q then update $Q^{A}, Q^{B}$ with decoupled select–evaluate
12:: else if Scheme is DQN then sample mini-batch from buffer and take a gradient step on TD loss
13:: else if Scheme is Dueling DQN then update value and advantage heads via TD loss
14:: else if Scheme is Actor–Critic then update policy with advantage estimate and update critic by value loss
15:: else if Scheme is PPO then compute advantages and optimize the clipped surrogate objective with value loss and entropy bonus
16:: $s \leftarrow s^{'}$
17:: if episode terminates then
18:: break
19:: end if
20:: end for
21:: Log episode metrics: throughput in Mbps, delay in ms, energy efficiency in Gbits/J, fairness, blocking probability, utilization, training time
22:: end for
23:: return trained policy and recorded metrics

In the following subsections, each of the six reinforcement learning schemes integrated within MARL-PASM is described in detail. These include tabular Q-Learning, Double Q-Learning, DQN, Dueling DQN, Actor–Critic, and PPO, with emphasis on their learning mechanisms and update rules.

4.1. Tabular Q-Learning and Double Q-Learning

Tabular Q-Learning maintains a value table

Q (s, a)

representing the expected long-term reward of taking action a in state s and following the current policy thereafter [29]. It updates values using the rule

Q (s_{t}, a_{t}) \leftarrow Q (s_{t}, a_{t}) + α [r_{t} + γ max_{a^{'}} Q (s_{t + 1}, a^{'}) - Q (s_{t}, a_{t})],

(6)

where

α

is the learning rate and

γ

is the discount factor. This approach is simple and interpretable, but scales poorly with large state–action spaces.

To mitigate overestimation bias, Double Q-Learning [30] introduces two independent estimators

Q_{1}

and

Q_{2}

:

Q_{1} (s_{t}, a_{t}) \leftarrow Q_{1} (s_{t}, a_{t}) + α [r_{t} + γ Q_{2} (s_{t + 1}, arg max_{a^{'}} Q_{1} (s_{t + 1}, a^{'})) - Q_{1} (s_{t}, a_{t})],

(7)

Q_{2} (s_{t}, a_{t}) \leftarrow Q_{2} (s_{t}, a_{t}) + α [r_{t} + γ Q_{1} (s_{t + 1}, arg max_{a^{'}} Q_{2} (s_{t + 1}, a^{'})) - Q_{2} (s_{t}, a_{t})] .

(8)

Both methods use the discrete state representation defined in Section 3. These algorithms serve as baseline models for comparison with neural approaches.

4.2. Actor–Critic Q-Learning

Actor–Critic methods [29] decompose the policy learning into two components: the actor

π (a | s)

selects actions based on the current policy, and the critic

V (s)

estimates the value function to guide policy updates. We use a tabular Actor–Critic variant where the critic updates the state-value estimate using

V (s_{t}) \leftarrow V (s_{t}) + α_{c} [r_{t} + γ V (s_{t + 1}) - V (s_{t})],

(9)

The actor updates the policy via preference values

P (s, a)

, typically using a softmax policy:

π (a | s) = \frac{exp (P (s, a))}{\sum_{a^{'}} exp (P (s, a^{'}))} .

(10)

The preference is updated as

P (s_{t}, a_{t}) \leftarrow P (s_{t}, a_{t}) + α_{a} \cdot δ_{t},

(11)

where

δ_{t} = r_{t} + γ V (s_{t + 1}) - V (s_{t})

is the temporal-difference error.

Actor–Critic methods tend to converge faster in dynamic environments by decoupling policy and value updates, which is beneficial for H-IoT systems with mixed device demands.

4.3. Deep Q-Network (DQN)

To address the scalability limitations of tabular methods, we adopt DQN, which approximates

Q (s, a)

using a neural network

Q_{θ} (s, a)

[28]. The network is trained to minimize the temporal difference loss:

L (θ) = E_{(s, a, r, s^{'})} [{(r + γ max_{a^{'}} Q_{θ^{-}} (s^{'}, a^{'}) - Q_{θ} (s, a))}^{2}],

(12)

where

θ^{-}

denotes the target network parameters updated periodically for training stability.

DQN employs experience replay, storing transitions

(s, a, r, s^{'})

in a buffer and sampling mini-batches for training. Input states are encoded using one-hot encoding for device class, current channel, and last action. This enables the agent to generalize across diverse traffic scenarios.

4.4. Dueling Deep Q-Network (Dueling DQN)

Dueling DQN [28] enhances learning by decomposing

Q (s, a)

into two separate estimators:

Q (s, a) = V (s) + A (s, a) - \frac{1}{| A |} \sum_{a^{'}} A (s, a^{'}),

(13)

where

V (s)

represents the state value and

A (s, a)

the advantage of action a in state s. This architecture helps the agent identify important states independently of the action, improving learning efficiency and policy robustness.

The dueling architecture employs two neural network branches sharing initial layers, with one estimating

V (s)

and the other

A (s, a)

. The aggregated output provides the final

Q (s, a)

values for action selection.

4.5. Proximal Policy Optimization (PPO)

PPO is a policy-gradient reinforcement learning algorithm designed to achieve stable and reliable policy updates, particularly in complex or high-dimensional environments [9]. PPO optimizes a clipped surrogate objective that constrains the policy update within a predefined trust region, preventing destructive large-step updates and improving convergence stability. The policy and value functions are both parameterized using deep neural networks, with the actor network producing a probability distribution over actions and the critic network estimating the state-value function.

The clipped objective function used in PPO is given by

L^{CLIP} (θ) = E_{t} [min (r_{t} (θ) {\hat{A}}_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) {\hat{A}}_{t})],

(14)

where

r_{t} (θ) = \frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{old}} (a_{t} | s_{t})}

is the probability ratio between the new and old policies,

{\hat{A}}_{t}

is the advantage estimate at time t, and

ϵ

is the clipping parameter. This formulation limits the policy update size while encouraging improvement only when it aligns with the advantage estimate.

In this work, PPO is implemented in an on-policy setting with mini-batch stochastic gradient descent updates. We adopt an adaptive Kullback–Leibler (KL) divergence penalty and entropy regularization to balance exploration and exploitation. Hyperparameters such as learning rate, clipping range, discount factor, and update frequency are tuned to ensure a fair comparison with the other MARL schemes. By incorporating PPO into the MARL-PASM framework, we aim to evaluate its potential for improving adaptability, convergence speed, and policy robustness in heterogeneous H-IoT scenarios.

5. Experimental Setup

The proposed MARL-PASM framework is evaluated in a discrete-time, event-driven simulator that emulates realistic H-IoT operating conditions with heterogeneous device classes, latency-sensitive traffic, and energy-constrained spectrum access. Each device belongs to one of three clinically inspired classes with distinct transmission power limits, sampling rates, payload sizes, and battery capacities. Class-dependent energy-per-bit costs and mobility-induced throughput penalties directly influence packet arrivals, achievable rates, and energy depletion, allowing MARL agents to learn priority-aware scheduling under realistic constraints. All agents share a spectrum divided into five orthogonal channels, with each episode modeling sequential access, contention resolution, and dynamic decision-making. Simulations were implemented in Python 3.13 and executed on a 64-bit workstation. Hyperparameters such as learning rate, discount factor, and exploration decay were tuned for each scheme within consistent ranges to ensure stability and fairness in comparison.

The state space provided to each RL agent includes not only the device class, channel index, and previous action, but also the instantaneous queue occupancy and recent SINR statistics, which capture short-term backlog conditions and interference variability. This augmentation provides the agents with a more context-aware view of the environment without substantially increasing the state dimensionality. Table 2 summarizes the physical-layer and network-level parameters. Where applicable, values are adapted from the IEEE TMLCN [31] study to ensure realism in the assumed channel and transmission models.

The Baseline Scenario uses

N = 3

devices to study reward dynamics and convergence behavior in a controlled setting. The Scalability Sweep increases the network size up to

N = 50

devices, randomly assigned to classes with the stated proportions, to emulate dense hospital or smart clinic deployments with significant spectrum contention. Device heterogeneity in energy budgets, sampling rates, transmission power constraints, and payload sizes follows the class profiles described in Table 2, ensuring that class-specific operational lifetimes and performance trade-offs are faithfully reflected. The 30–40–30 distribution was selected to mirror realistic H-IoT traffic patterns. Medium-priority transmissions, which are

40 %

dominant due to continuous monitoring and diagnostic data, while high-priority emergency alerts, which are

30 %

, and low-priority background updates

30 %

, occur less frequently. This balance provides a representative workload for evaluating spectrum management policies under heterogeneous demand. All results are averaged over five independent runs with different random seeds to ensure statistical reliability.

Traffic models incorporate both periodic and bursty packet arrivals, with emergency devices capable of generating irregular high-priority bursts, glucose monitors producing low-rate continuous readings, and fitness trackers generating moderate-rate periodic data. These patterns, coupled with class-specific transmit powers, payload sizes, and energy budgets, emulate realistic medical traffic characteristics observed in hospital and home-care environments. Channel conditions include additive white Gaussian noise, log-normal shadowing, and distance-dependent path loss; no inter-channel interference is assumed for orthogonal allocation. Performance is evaluated using average throughput, delay, energy efficiency, Jain’s fairness index, blocking probability, interruption probability, convergence time, and training time, enabling a comprehensive comparison of algorithmic trade-offs under realistic H-IoT conditions. For PPO, we adopt a clipped surrogate objective with

ϵ = 0.2

, learning rate

= 3 \times 10^{- 4}

, discount factor

γ = 0.99

, update frequency of 4 epochs per batch, and entropy coefficient

0.01

, ensuring a balance between convergence speed and policy stability for fair comparison with other RL schemes.

6. Results and Discussion

This section presents a comprehensive evaluation of MARL-PASM against six reinforcement learning schemes, focusing on baseline behavior, scalability under varying network sizes, distributional performance, and convergence/complexity trade-offs. The analysis connects observed trends directly to URLLC requirements, highlighting how MARL-PASM addresses throughput, delay, fairness, energy efficiency, and blocking probability in H-IoT networks.

6.1. Learning Dynamics and Baseline Performance

This section analyzed the learning behavior of the six reinforcement learning algorithms, using three fundamental measures: the moving average of reward per episode, the convergence of maximum Q-values, and detailed PPO training diagnostics. Figure 4 presents the moving average of rewards across 1000 training episodes. PPO consistently demonstrates superior reward learning, reaching values close to 95–100 and converging quickly within the first 200 episodes. Actor–Critic also performs strongly, stabilizing around 85–90. Dueling DQN achieves stable returns of approximately 80–85, while Q-Learning, Double Q-Learning, and DQN remain in the range of 70–80, reflecting their relatively slower convergence under complex H-IoT traffic conditions.

The Q-value convergence trends are depicted in Figure 5. DQN achieves the highest stability, plateauing around 25–27, followed by Actor–Critic, which stabilizes near 22–24. Q-Learning and Double Q-Learning converge to more modest values of 15–16 and 5–6, respectively, showing their conservative policy updates. Interestingly, Dueling DQN, despite performing well in terms of rewards, shows consistently low Q-value estimates (near 0–1), which suggests that its advantage decomposition emphasizes relative action evaluation rather than absolute Q-value scaling. It is important to note that Q-value convergence is not directly applicable to PPO, as it does not rely on explicit Q-value estimation. Instead, PPO stability is assessed via its diagnostics (policy loss, value loss, entropy, and KL divergence), shown in Figure 6. This explains the addition of a dedicated PPO diagnostic figure alongside reward and Q-value plots.

To further analyze PPO, Figure 6 provides detailed diagnostics. The policy loss remains tightly bounded around zero, confirming stable clipped surrogate updates. The value loss fluctuates between 2000 and 3000, reflecting the critic’s effort to adapt to diverse traffic and channel conditions. Policy entropy remains between 0.5 and 1.5, indicating sustained exploration during training. Finally, the KL divergence between old and new policies stays mostly below 0.015, ensuring that PPO maintains stability without catastrophic policy shifts. The combination of high reward and stable convergence trends highlights the strong learning potential of PPO and Actor–Critic in spectrum management for H-IoT. DQN also shows promising convergence but with slightly lower returns. Meanwhile, Q-Learning and Double Q-Learning remain suitable for lightweight deployments where simplicity and stability are prioritized, though they underperform in complex dynamic environments. Dueling DQN demonstrated good reward performance but limited Q-value growth, suggesting that its structural advantage is more beneficial for relative action selection than absolute value scaling.

6.2. Scalability and Robustness

To evaluate the framework’s robustness in dense H-IoT scenarios, we simulate scaling behavior from 5 to 50 devices. Figure 7, Figure 8, Figure 9, Figure 10 and Figure 11 illustrate the trends in throughput, delay, fairness, blocking probability, and training time across all six reinforcement learning methods.

Figure 7 highlights that Dueling DQN sustains the highest throughput, stabilizing around 36–37 units regardless of scale, followed closely by DQN at 34–35 units. In contrast, Q-Learning, Double Q-Learning, and Actor–Critic all experience sharp throughput declines below five units as device count increases. PPO performs well under light load (≈28 units at five devices) but quickly degrades under congestion, converging near one to two units at high density. Delay trends, shown in Figure 8, reinforce these findings. PPO demonstrates the lowest and most consistent delay (~2 ms), while Dueling DQN achieves low and stable latency (14–15 ms). Q-Learning, Double Q-Learning, and DQN stabilize in the range of ~16–19 ms. Actor–Critic, however, suffers from the worst delay, saturating near 99 ms, reflecting instability in its value estimation under scale.

Fairness outcomes in Figure 9 show that Dueling DQN consistently achieves the highest fairness index (~0.65), followed by DQN (~0.62). PPO lags behind significantly, rarely exceeding 0.2, indicating a trade-off between its low blocking probability and equitable resource allocation. Q-Learning and Double Q-Learning exhibit fairness dips at intermediate scales but recover to ~0.25–0.3. Actor–Critic, though initially poor, improves beyond 0.35 at 50 devices, suggesting late adaptivity. Blocking probability results (Figure 10) demonstrate that PPO achieves the lowest blocking (~0.01), outperforming all other methods. Dueling DQN follows at 0.07–0.08, and DQN at 0.09–0.10, while Q-Learning, Double Q-Learning, and Actor–Critic remain significantly higher around 0.17–0.18.

Training time comparisons (Figure 11) add another layer of insight. Tabular Q-Learning, Double Q-Learning, and Actor–Critic remain lightweight, with runtimes near 1 s. PPO shows moderate computational overhead (~30–35 s), while DQN requires ~55 s, and Dueling DQN incurs the highest cost (~80–90 s). This highlights the trade-off: PPO is efficient in latency and blocking but less fair, while Dueling DQN demands more computation yet provides the most balanced overall scalability. Taken together, these results suggest that Dueling DQN and PPO offer complementary strengths. Dueling DQN achieves robust fairness, high throughput, and low delay, while PPO minimizes blocking and delay but struggles with fairness under scale. For fairness-critical and high-utilization H-IoT deployments, Dueling DQN is preferable, whereas PPO may be suitable for ultra-reliable, latency-sensitive applications where fairness is less critical.

6.3. Distributional Performance Insights

This section provides a detailed distributional analysis of key performance indicators using cumulative distribution functions (CDFs) and corresponding mean bar plots. The performance of each MARL variant is evaluated with respect to throughput, delay, energy efficiency, fairness, channel utilization, and interruption probability.

6.3.1. Throughput Distribution

Figure 12 and Figure 13 present the throughput distribution (CDF) and mean throughput across all schemes. The results reveal a more balanced landscape compared to earlier findings. From the CDF in Figure 12, all reinforcement learning schemes exhibit relatively close distributions, with slight shifts in tail behavior. PPO demonstrated a slightly delayed rise in its CDF curve, indicating that while it achieves higher peak throughput in certain instances, its distribution is less consistent at lower percentiles compared to DQN and Dueling DQN.

The mean throughput comparison in Figure 13 highlights PPO as the best performer, achieving an average throughput of approximately

0.57

Mbps. Dueling DQN and Actor–Critic follow closely at around

0.55

–

0.56

Mbps, confirming their robustness under varying traffic loads. In contrast, Q-Learning, Double Q-Learning, and DQN yield lower mean throughput values (around

0.53

–

0.54

Mbps), reflecting the limitations of tabular and standard value-based approaches in dynamic environments. These findings demonstrate that PPO and Dueling DQN sustain higher throughput under realistic load conditions, directly supporting life-critical data transmission in H-IoT.

6.3.2. Delay Distribution

Figure 14 and Figure 15 illustrate the comparative delay distributions across all six learning schemes. The results clearly indicate that PPO achieves the lowest delays, with a mean delay of approximately 7 ms and its CDF rising steeply, demonstrating that the majority of transmissions complete within very low latency bounds. This highlights PPO’s efficiency in maintaining stability under varying network conditions.

Q-Learning and Double Q-Learning exhibit moderate performance, averaging around 10–12 ms, while Dueling DQN maintains a slightly higher mean of 12 ms. DQN lags further behind with an average delay close to 14 ms, showing less robustness in latency-sensitive scenarios. Actor–Critic, although computationally lightweight, continues to underperform with delays frequently exceeding 80 ms in tail cases, reflecting instability in value estimation and gradient updates.

Importantly, tail performance analysis shows that PPO reduces the 95th-percentile delay by approximately 11.5% compared to the next best scheme. This confirms that PPO not only minimizes average latency but also effectively suppresses extreme delay outliers, which is crucial for URLLC scenarios where millisecond-level guarantees are decisive.

6.3.3. Energy Efficiency Analysis

Figure 16 and Figure 17 illustrate the distribution and mean values of energy efficiency across all six schemes. PPO achieves the highest mean energy efficiency, reaching nearly

1.5 \times 10^{10}

bits/Joule, outperforming all other algorithms. This highlights PPO’s ability to optimize both spectral utilization and power consumption simultaneously. Dueling DQN and DQN follow closely, with mean efficiencies of

1.34 \times 10^{10}

and

1.33 \times 10^{10}

bits/Joule, respectively, demonstrating the strength of deep neural approximators in sustaining efficient policies under high-load H-IoT conditions.

Tabular Q-Learning and Double Q-Learning exhibit comparatively lower efficiency, around

1.29 \times 10^{10}

and

1.19 \times 10^{10}

bits/Joule, reflecting their limited adaptability in dynamic spectrum environments. Actor–Critic records the lowest performance among function approximators, at approximately

1.13 \times 10^{10}

bits/Joule, consistent with its weaker convergence behavior observed in earlier subsections.

6.3.4. Fairness Evaluation

Figure 18 and Figure 19 demonstrate that all six schemes achieve consistently high fairness, with mean values clustered in the narrow range of

0.82

–

0.83

. Unlike earlier observations where Dueling DQN appeared dominant, the updated results reveal only marginal differences among the algorithms. PPO achieves the highest fairness at approximately

0.83

, closely followed by Dueling DQN and Q-Learning. Double Q-Learning and DQN exhibit nearly identical fairness levels, while Actor–Critic also maintains comparable equity in resource allocation. The convergence of fairness indices across schemes highlights that modern RL-based policies are capable of ensuring socially fair spectrum access even under dense vehicular IoT conditions. This suggests that throughput and delay optimization do not come at the expense of fairness, reinforcing the robustness of the proposed PASM framework. The convergence of fairness across schemes indicates that MARL-based policies can improve efficiency without compromising the equity of spectrum access.

6.3.5. Channel Utilization and Interruption Risk

Figure 20 and Figure 21 present the trade-off between channel utilization and interruption probability across all six schemes. The results show that PPO achieves the highest mean channel utilization, approaching 0.19 (normalized), with consistently wider distribution bounds. Dueling DQN and Actor–Critic follow closely, maintaining utilization above 0.18, whereas Q-Learning, Double Q-Learning, and DQN show slightly lower averages (0.17–0.18). This indicates that deep policy-based methods allocate resources more aggressively and effectively in congested conditions.

However, higher utilization must be balanced against interruption risk. The CDF of interruption probability highlights that Actor–Critic exhibits the steepest distribution, reaching nearly 90% probability of interruption in extreme cases, reflecting unstable policy convergence. Q-Learning and Double Q-Learning perform moderately, but their interruption probability spreads across wider ranges, indicating less predictability. In contrast, PPO demonstrated a more gradual CDF slope, sustaining lower interruption risks relative to its higher utilization. Dueling DQN also performs robustly, offering a balanced compromise between throughput-driven utilization and stability.

6.4. Convergence vs. Computational Cost

Figure 22 showed that PPO converges in the fewest episodes (around 75) and with the lowest training time below 80 s, highlighting its efficiency in both learning and computation. In contrast, Actor–Critic converges in nearly 440 episodes and incurs the largest training cost of approximately 250 s, revealing instability in its policy gradient updates. Among deep Q-learning variants, Dueling DQN and DQN converge in approximately 340–390 episodes, slightly slower than tabular methods but more stable once converged. Tabular Q-Learning and Double Q-Learning require around 310–325 episodes, but with negligible training overhead below 25 s, which makes them lightweight but less robust at scale.

Beyond training convergence, we also assessed inference complexity across schemes. Tabular methods required negligible memory, under 1 kB, and delivered near-instant decision latency, yet they struggled to scale in accuracy and stability. DQN and Dueling DQN used model sizes in the tens of kilobytes and achieved inference times around 0.02 to 0.03 ms per decision, yielding thousands of decisions per second on commodity hardware. PPO used the largest model but remained highly efficient, with inference latency close to 0.01 ms and the highest decision throughput. Actor–Critic fell between these extremes, with a footprint of 60k parameters and inference time of 0.04 ms per decision, which is larger than tabular methods but slightly higher in latency than PPO.

Figure 23 further evaluates the blocking probability. PPO delivered the lowest blocking probability of around 0.02 with limited variance, demonstrating its ability to ensure reliable service continuity. Actor–Critic also achieves a low average blocking near 0.03, but its higher variance suggests inconsistency across traffic scenarios. Dueling DQN and Q-Learning maintain moderate blocking rates around 0.045, while Double Q-Learning and DQN record the highest blocking rates between 0.055 and 0.06, which may hinder performance under heavier loads.

6.5. Holistic Performance Comparison

Table 3 consolidates the performance of all six schemes across throughput, delay, energy efficiency, fairness, blocking probability, convergence behavior, and training time. The results reveal that PPO consistently outperforms its counterparts by delivering the highest throughput of 680 Mbps, the lowest average delay of 9.2 ms, and the strongest energy efficiency of 15 gigabits per joule. It also achieves the lowest blocking probability of 0.02 and the fastest convergence at 75 episodes, balancing stability with computational efficiency. Among the deep Q-learning methods, Dueling DQN emerges as the most competitive alternative. It offers balanced throughput of 640 Mbps, low delay of 10.4 ms, high energy efficiency of 13.4 gigabits per joule, and fairness of 0.823. Although its training time of 190 s is higher than PPO, its robustness makes it a suitable candidate for applications where policy stability is critical. Standard DQN showed slightly weaker delay and blocking performance, though it maintains strong throughput and energy metrics.

The tabular methods, Q-Learning and Double Q, demonstrate lower computational overhead but lag in throughput, energy efficiency, and blocking probability, which highlights their limited scalability in complex H-IoT environments. Actor–Critic performs moderately, with reasonable fairness and blocking values, but its slower convergence of 440 episodes and higher training time of 250 s make it less practical for real-time deployments. These findings underscore PPO as the most promising scheme for large-scale H-IoT spectrum management, with Dueling DQN as a strong deep reinforcement learning alternative. The comparative analysis validates the trade-off between algorithmic complexity and operational performance, providing insights into the selection of appropriate strategies for different deployment scenarios. These results show that evaluating diverse RL schemes in a unified PASM framework reveals complementary strengths that were previously overlooked in single-scheme studies. This integrated perspective constitutes the key novelty of our work and provides actionable insights for selecting spectrum policies in practical H-IoT deployments.

Table 3 presents the averaged numerical results for all evaluated schemes, complementing the graphical analysis and enabling direct quantitative comparison.

6.6. Practical Considerations for Deploying MARL in H-IoT Networks

Although the proposed MARL-PASM framework performs satisfactorily under the simulated scenario, its implementation as a real-world H-IoT deployment brings along several practical problems that require careful scrutiny. In practice, it is not simple to estimate the global or shaped reward signals within a real network. Centralized orchestration through an edge server can be used to gather observations and perform feedback computation on the basis of statistics such as latency violations, interference levels, or successful data delivery. However, centralized reward computation introduces additional delay and scalability issues. Furthermore, coordination between MARL agents, which is very important for learning stability and convergence, creates non-trivial communication overhead, particularly when bandwidth and latency are limited. To counteract this, real-world systems have the option to utilize decentralized training, federated MARL, or low-communication architectures in order to sacrifice coordination in favor of efficiency. Partial observability is another significant problem. H-IoT nodes usually operate with partial environmental awareness, which can reduce convergence and policy robustness. Techniques such as belief-state modeling or recurrent policy learning can offer robustness under such limitations. Training complexity is also a concern; real-world agents are either trained offline in digital twin platforms or finetuned in a lightweight way to remain computationally manageable. Finally, the network stack imposes a constraint on design. As mentioned in recent research such as [31], MAC-layer learning platforms must incorporate realistic access delays, collision dynamics, and signaling limitations in order to remain deployable. These factors guide the research roadmap in the future, in which scalability, feasibility, and protocol-awareness must be co-optimized to render MARL-based spectrum access nearly possible for clinical and mission-critical H-IoT applications.

6.7. Future Work

In future extensions of this work, we will move beyond the fixed three-tier device classification used here and adopt context-aware dynamic prioritization. This will allow devices to adjust their priority in real time based on physiological readings, patient location, or temporal urgency, thereby more accurately reflecting real-world medical requirements. We will design lightweight classifiers to enable this dynamic priority assignment while ensuring low computational overhead, which is essential for resource-constrained H-IoT deployments. Another direction we will pursue is the development of hybrid reinforcement learning architectures that merge complementary schemes. For example, we will combine the stable policy updates of PPO with the rapid value-based learning of DQN variants, or integrate Actor–Critic frameworks with double estimators to mitigate bias. These hybrid approaches will be designed to improve adaptability, accelerate convergence, and enhance policy robustness in complex and rapidly changing H-IoT environments. We will also extend the state space to include long-term temporal patterns such as time-of-day usage cycles and patient-specific activity trends, enabling agents to anticipate and adapt to predictable fluctuations in healthcare traffic. Future work will also extend PASM with classical scheduling baselines such as priority schedulers, TDMA, and proportional fair allocation, enabling direct head-to-head benchmarking with RL-based methods.

In addition, we will expand MARL-PASM to handle extreme-event traffic patterns, such as scenarios where many devices simultaneously transmit emergency data or where traffic dynamics shift abruptly. This will allow us to stress-test the framework and assess its scalability under highly dynamic and challenging conditions. We will also incorporate explicit fault tolerance evaluation, covering sudden device disconnections, overlapping transmissions, high packet loss rates, and rapid energy depletion. These experiments will provide a deeper understanding of the system’s resilience and reliability under adverse operating environments. Furthermore, we will integrate privacy-preserving mechanisms, with a focus on Federated Multi-Agent Reinforcement Learning, where agents learn locally and share only model updates. By coupling this with differential privacy, we will ensure sensitive patient data remains protected while maintaining effective collaborative learning, thereby meeting strict medical data protection requirements. Finally, we will re-implement and benchmark MARL-PASM directly against advanced distributed spectrum allocation and RL-based routing models to enable consistent, head-to-head comparisons. This will ensure that our framework is rigorously evaluated against state-of-the-art solutions and highlights its contributions to the broader landscape of spectrum management in healthcare IoT systems.

7. Conclusions

This study presented the MARL-PASM framework for dynamic spectrum allocation in H-IoT environments, addressing the challenges of heterogeneous traffic classes and stringent QoS demands through centralized, intelligent decision-making. By benchmarking six RL strategies, including PPO, we demonstrated that advanced DRL models deliver clear performance advantages. PPO consistently achieved the highest throughput, lowest latency, strongest energy efficiency, and lowest blocking probability, while Dueling DQN emerged as the most balanced deep Q-learning alternative, sustaining robust performance under dense network conditions. Actor–Critic offered fast convergence under lighter traffic loads, and the tabular methods, though computationally efficient, lacked scalability in complex deployments. Scalability and robustness evaluations confirmed the ability of DRL methods to maintain stability and fairness as device density increases, highlighting their potential for real-world H-IoT applications.

Despite these promising outcomes, several limitations remain. The current framework employs fixed device classes and static priority distributions, does not fully account for extreme emergency surges, and has not been rigorously stress-tested for fault tolerance under conditions such as device disconnections, interference spikes, or energy depletion. Privacy-preserving mechanisms such as federated MARL are also yet to be explored, and validation using real-world, trace-driven datasets is still required. Addressing these gaps will involve incorporating dynamic prioritization, hybrid RL architectures, explicit resilience testing, and privacy-preserving learning techniques. Future work will also benchmark MARL-PASM directly against state-of-the-art distributed spectrum allocation and RL-based routing models to ensure a rigorous comparative evaluation. By advancing along these directions, MARL-PASM can evolve into a robust, scalable, and secure spectrum management framework capable of supporting the safety-critical requirements of next-generation H-IoT communication systems.

Author Contributions

Conceptualization, A.I.; simulations, A.I., A.N. and T.K.; writing—original draft preparation, A.I. and T.K.; writing—review and editing, A.I., A.N. and S.-B.R.; supervision, S.-B.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to our ongoing project.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bhuiyan, M.N.; Rahman, M.M.; Billah, M.M.; Saha, D. Internet of things (IoT): A review of its enabling technologies in healthcare applications, standards protocols, security, and market opportunities. IEEE Internet Things J. 2021, 8, 10474–10498. [Google Scholar] [CrossRef]
Kumar, A.; Kaur, R.; Gaur, N.; Nanthaamornphong, A. Exploring and analyzing the role of hybrid spectrum sensing methods in 6G-based smart health care applications. F1000Research 2024, 13, 110. [Google Scholar] [CrossRef]
Iqbal, A.; Nauman, A.; Qadri, Y.A.; Kim, S.W. Optimizing Spectral Utilization in Healthcare Internet of Things. Sensors 2025, 25, 615. [Google Scholar] [CrossRef]
Abdellatif, A.A.; Mhaisen, N.; Mohamed, A.; Erbad, A.; Guizani, M. Reinforcement learning for intelligent healthcare systems: A review of challenges, applications, and open research issues. IEEE Internet Things J. 2023, 10, 21982–22007. [Google Scholar] [CrossRef]
Almagrabi, A.O.; Ali, R.; Alghazzawi, D.; AlBarakati, A.; Khurshaid, T. A reinforcement learning-based framework for crowdsourcing in massive health care internet of things. Big Data 2022, 10, 161–170. [Google Scholar] [CrossRef] [PubMed]
Jiang, H.; Li, G.; Xie, J.; Yang, J. Action candidate driven clipped double Q-learning for discrete and continuous action tasks. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 5269–5279. [Google Scholar] [CrossRef]
Swapno, S.M.R.; Nobel, S.N.; Meena, P.; Meena, V.; Azar, A.T.; Haider, Z.; Tounsi, M. A reinforcement learning approach for reducing traffic congestion using deep Q learning. Sci. Rep. 2024, 14, 30452. [Google Scholar] [CrossRef]
Mohi Ud Din, N.; Assad, A.; Ul Sabha, S.; Rasool, M. Optimizing deep reinforcement learning in data-scarce domains: A cross-domain evaluation of double DQN and dueling DQN. Int. J. Syst. Assur. Eng. Manag. 2024, 1–12. [Google Scholar] [CrossRef]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Qadri, Y.A.; Nauman, A.; Zikria, Y.B.; Vasilakos, A.V.; Kim, S.W. The future of healthcare internet of things: A survey of emerging technologies. IEEE Commun. Surv. Tutor. 2020, 22, 1121–1167. [Google Scholar] [CrossRef]
Selvaraj, S.; Sundaravaradhan, S. Challenges and opportunities in IoT healthcare systems: A systematic review. SN Appl. Sci. 2020, 2, 139. [Google Scholar] [CrossRef]
Huang, Q.; Xie, X.; Cheriet, M. Reinforcement learning-based hybrid spectrum resource allocation scheme for the high load of URLLC services. EURASIP J. Wirel. Commun. Netw. 2020, 2020, 1–21. [Google Scholar] [CrossRef]
Li, F.; Lam, K.Y.; Sheng, Z.; Zhang, X.; Zhao, K.; Wang, L. Q-learning-based dynamic spectrum access in cognitive industrial Internet of Things. Mob. Netw. Appl. 2018, 23, 1636–1644. [Google Scholar] [CrossRef]
Raza, A.; Ali, M.; Ehsan, M.K.; Sodhro, A.H. Spectrum evaluation in CR-based smart healthcare systems using optimizable tree machine learning approach. Sensors 2023, 23, 7456. [Google Scholar] [CrossRef]
Naparstek, O.; Cohen, K. Deep multi-user reinforcement learning for distributed dynamic spectrum access. IEEE Trans. Wirel. Commun. 2018, 18, 310–323. [Google Scholar] [CrossRef]
Song, H.; Liu, L.; Ashdown, J.; Yi, Y. A deep reinforcement learning framework for spectrum management in dynamic spectrum access. IEEE Internet Things J. 2021, 8, 11208–11218. [Google Scholar] [CrossRef]
Kim, S. Learning and game based spectrum allocation model for internet of medical things (IoMT) platform. IEEE Access 2023, 11, 48059–48068. [Google Scholar] [CrossRef]
Tan, X.; Zhou, L.; Wang, H.; Sun, Y.; Zhao, H.; Seet, B.C.; Wei, J.; Leung, V.C. Cooperative multi-agent reinforcement-learning-based distributed dynamic spectrum access in cognitive radio networks. IEEE Internet Things J. 2022, 9, 19477–19488. [Google Scholar] [CrossRef]
Seid, A.M.; Erbad, A.; Abishu, H.N.; Albaseer, A.; Abdallah, M.; Guizani, M. Multiagent federated reinforcement learning for resource allocation in UAV-enabled Internet of Medical Things networks. IEEE Internet Things J. 2023, 10, 19695–19711. [Google Scholar] [CrossRef]
Su, X.; Fang, X.; Cheng, Z.; Gong, Z.; Choi, C. Deep reinforcement learning based latency-energy minimization in smart healthcare network. Digit. Commun. Netw. 2024, 11, 795–805. [Google Scholar] [CrossRef]
Iqbal, A.; Khurshaid, T.; Nauman, A.; Rhee, S.B. Energy-Aware Ultra-Reliable Low-Latency Communication for Healthcare IoT in Beyond 5G and 6G Networks. Sensors 2025, 25, 3474. [Google Scholar] [CrossRef]
Khan, N.; Coleri, S. Event-Triggered Reinforcement Learning Based Joint Resource Allocation for Ultra-Reliable Low-Latency V2X Communications. IEEE Trans. Veh. Technol. 2024, 73, 16991–17006. [Google Scholar] [CrossRef]
Khadem, M.; Zeinali, F.; Mokari, N.; Saeedi, H. AI-enabled priority and auction-based spectrum management for 6G. In Proceedings of the 2024 IEEE Wireless Communications and Networking Conference (WCNC), Dubai, United Arab Emirates, 21–24 April 2024; pp. 1–6. [Google Scholar]
Chen, G.; Shao, R.; Shen, F.; Zeng, Q. Slicing resource allocation based on dueling DQN for eMBB and URLLC hybrid services in heterogeneous integrated networks. Sensors 2023, 23, 2518. [Google Scholar] [CrossRef]
Elhachmi, J. Distributed reinforcement learning for dynamic spectrum allocation in cognitive radio-based internet of things. IET Netw. 2022, 11, 207–220. [Google Scholar] [CrossRef]
Rahman, M.H.; Bayrak, A.E.; Sha, Z. A reinforcement learning approach to predicting human design actions using a data-driven reward formulation. Proc. Des. Soc. 2022, 2, 1709–1718. [Google Scholar] [CrossRef]
Jain, R.K.; Chiu, D.M.W.; Hawe, W.R. A quantitative measure of fairness and discrimination. East. Res. Lab. Digit. Equip. Corp. 1984, 21, 2022–2023. [Google Scholar]
Wang, Z.; Schaul, T.; Hessel, M.; Hasselt, H.; Lanctot, M.; Freitas, N. Dueling network architectures for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1995–2003. [Google Scholar]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, UK, 1998; Volume 1. [Google Scholar]
Hasselt, H. Double Q-learning. Adv. Neural Inf. Process. Syst. 2010, 23, 2613–2621. [Google Scholar]
Miuccio, L.; Riolo, S.; Samarakoon, S.; Bennis, M.; Panno, D. On learning generalized wireless MAC communication protocols via a feasible multi-agent reinforcement learning framework. IEEE Trans. Mach. Learn. Commun. Netw. 2024, 2, 298–317. [Google Scholar] [CrossRef]

Figure 1. Applications of Healthcare IoT systems.

Figure 2. Detailed system model.

Figure 3. MARL -PASM.

Figure 4. Moving average of reward across training episodes.

Figure 5. Convergence of maximum Q-values across episodes.

Figure 6. PPO training diagnostics: (top-left) policy loss, (top-right) value loss, (bottom-left) policy entropy, and (bottom-right) KL divergence.

Figure 7. Throughput vs. number of devices.

Figure 8. Delay vs. number of devices.

Figure 9. Fairness index vs. number of devices.

Figure 10. Blocking probability vs. number of devices.

Figure 11. Training time vs. number of devices.

Figure 12. CDF of throughput.

Figure 13. Mean bar plots of throughput.

Figure 14. CDF of delay.

Figure 15. Mean bar plots of delay.

Figure 16. CDF of energy efficiency across schemes.

Figure 17. Mean bar plots of energy efficiency.

Figure 18. CDF of fairness index.

Figure 19. Mean bar plots of fairness index.

Figure 20. Channel utilization under different policies.

Figure 21. CDF of interruption probability.

Figure 22. Convergence speed and training time across schemes. Colored bars show episodes to convergence (left y-axis), while the black dashed line with circle markers shows training time in seconds (right y-axis).

Figure 23. Mean blocking probability with standard deviation.

Table 1. Summary of representative RL-based spectrum management works in H-IoT and related networks.

Study	RL Technique	Application Focus	Advantages	Limitations
[13]	Q-Learning	Industrial IoT	Improves channel utilization via opportunistic access	Limited scalability; lacks priority-awareness for heterogeneous traffic
[17]	Bi-Level MARL	IoMT	Ensures fairness; increases throughput under diverse loads	High computational complexity; no real-time channel modeling
[15]	DQN	Multi-user IoT	Distributed spectrum access without central control	Ignores device heterogeneity and QoS prioritization
[16]	DQN	Dynamic IoT Spectrum	Adapts to interference; scalable in moderate networks	No evaluation under URLLC constraints
[18]	Cooperative MARL	Cognitive Radio	Enhances learning in dense deployments	Requires high inter-device coordination overhead
[19]	Federated MARL	UAV-assisted IoMT	Privacy-preserving spectrum and compute allocation	Increased latency from federated updates; ignores fading dynamics
[20]	DRL Offloading	Smart Healthcare	Achieves optimal delay-energy trade-off	Not designed for dynamic priority shifts in H-IoT
[21]	RL Scheduler	H-IoT URLLC	Improves fairness–energy trade-off in scheduling	Single-agent setup; lacks multi-agent scalability analysis
[24]	Dueling DQN	Network Slicing	Outperforms DQN and Double DQN in utility	Application limited to slicing; not generalized for H-IoT
[22]	Actor–Critic	Vehicular IoT	Meets URLLC spectrum-power constraints	No spectrum fairness or healthcare-specific modeling
[26]	RL-based Routing	Cognitive Radio IoT	Enhances route selection and interference avoidance	Lacks heterogeneity modeling and healthcare-specific QoS
[25]	Distributed MARL	Cognitive IoT Spectrum Allocation	Improves scalability and adaptability in dynamic spectrum use	No URLLC consideration; not tailored for H-IoT

Table 2. Simulation environment and transmission parameters.

Parameter	Value	Description
Episodes ( $N_{ep}$ )	500	Total training episodes per scheme
Time slots per episode (T)	1000	Discrete steps per episode
Slot duration ( $Δ t$ )	1 ms	URLLC-aligned time granularity
Number of channels (C)	5	Orthogonal parallel channels
Bandwidth per channel (B)	180 kHz	Per-channel bandwidth (5G NR RB)
Channel rate (R)	1 Mbps	Ideal PHY rate per channel
Transmit power per class ( $P_{tx}$ )	18/6/0 dBm	HP/MP/LP class transmit ceilings
Sampling rate per class	5.0/0.2/1.0 Hz	Packet generation rate: HP/MP/LP
Bits per sample	4000/2000/1500	Payload size per generated packet (bits)
Battery capacity	50/20/15 J	Initial energy: HP/MP/LP classes
Noise PSD ( $N_{0}$ )	$- 174$ dBm/Hz	Thermal noise floor
Receiver noise figure (F)	5 dB	Receiver front-end NF
Path loss model	Urban microcell (UMi)	3GPP TR 38.901-compliant
SINR threshold ( $γ_{th}$ )	3 dB	Minimum required SINR
Device types	3 classes	Emergency (HP), Glucose Monitor (MP), Fitness Tracker (LP)
Priority ratios	30%/40%/30%	High/Medium/Low class proportions

Table 3. Performance comparison across all schemes.

Scheme	Throughput (Mbps)	Delay (ms)	Energy Efficiency (Gbits/J)	Fairness	Blocking Prob.	Convergence (Episodes)	Training Time (s)
Q-Learning	540	11.2	12.9	0.823	0.045 ± 0.045	325	20
Double Q	510	12.8	11.9	0.824	0.057 ± 0.06	310	25
DQN	620	10.7	13.2	0.820	0.058 ± 0.06	390	150
Dueling DQN	640	10.4	13.4	0.823	0.048 ± 0.06	340	190
ActorCritic	470	13.5	11.3	0.822	0.030 ± 0.05	440	250
PPO	680	9.2	15.0	0.829	0.020 ± 0.03	75	75

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Iqbal, A.; Nauman, A.; Khurshaid, T.; Rhee, S.-B. A Scalable Reinforcement Learning Framework for Ultra-Reliable Low-Latency Spectrum Management in Healthcare Internet of Things. Mathematics 2025, 13, 2941. https://doi.org/10.3390/math13182941

AMA Style

Iqbal A, Nauman A, Khurshaid T, Rhee S-B. A Scalable Reinforcement Learning Framework for Ultra-Reliable Low-Latency Spectrum Management in Healthcare Internet of Things. Mathematics. 2025; 13(18):2941. https://doi.org/10.3390/math13182941

Chicago/Turabian Style

Iqbal, Adeel, Ali Nauman, Tahir Khurshaid, and Sang-Bong Rhee. 2025. "A Scalable Reinforcement Learning Framework for Ultra-Reliable Low-Latency Spectrum Management in Healthcare Internet of Things" Mathematics 13, no. 18: 2941. https://doi.org/10.3390/math13182941

APA Style

Iqbal, A., Nauman, A., Khurshaid, T., & Rhee, S.-B. (2025). A Scalable Reinforcement Learning Framework for Ultra-Reliable Low-Latency Spectrum Management in Healthcare Internet of Things. Mathematics, 13(18), 2941. https://doi.org/10.3390/math13182941

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Scalable Reinforcement Learning Framework for Ultra-Reliable Low-Latency Spectrum Management in Healthcare Internet of Things

Abstract

1. Introduction

2. Literature Review

3. System Model

Reward Function

4. Proposed MARL-PASM Framework

4.1. Tabular Q-Learning and Double Q-Learning

4.2. Actor–Critic Q-Learning

4.3. Deep Q-Network (DQN)

4.4. Dueling Deep Q-Network (Dueling DQN)

4.5. Proximal Policy Optimization (PPO)

5. Experimental Setup

6. Results and Discussion

6.1. Learning Dynamics and Baseline Performance

6.2. Scalability and Robustness

6.3. Distributional Performance Insights

6.3.1. Throughput Distribution

6.3.2. Delay Distribution

6.3.3. Energy Efficiency Analysis

6.3.4. Fairness Evaluation

6.3.5. Channel Utilization and Interruption Risk

6.4. Convergence vs. Computational Cost

6.5. Holistic Performance Comparison

6.6. Practical Considerations for Deploying MARL in H-IoT Networks

6.7. Future Work

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI