1. Introduction
The Internet of Things (IoT) has revolutionized modern healthcare by enabling continuous monitoring, precise diagnostics, and real-time interventions through interconnected sensors and devices, as shown in
Figure 1. Within this paradigm, Healthcare Internet of Things (H-IoT) systems encompass diverse medical applications, ranging from vital signs monitoring and glucose tracking to emergency alerts and remote consultations, that demand ultra-reliable and low-latency communication (URLLC) [
1].
The massive deployment of H-IoT devices in clinical environments, particularly in urban and hospital settings, has intensified spectrum scarcity, leading to contention among devices and degradation in Quality of Service (QoS). This risks life-critical data delivery. Traditional spectrum management schemes, often centralized and statically configured, struggle to meet the dynamic and priority-sensitive demands of medical IoT devices. They typically fail to adapt in real time to fluctuating traffic loads, heterogeneous service priorities, and rapid variations in wireless channel conditions. Moreover, the computational and signaling overheads of centralized methods make them unsuitable for large-scale, delay-sensitive healthcare scenarios [
2,
3].
To address these limitations, reinforcement learning (RL) has emerged as a robust paradigm for adaptive spectrum management. RL enables devices to learn efficient resource allocation policies by interacting with the environment and autonomously adapting to varying network states. Unlike conventional methods, RL algorithms are inherently suitable for multi-objective optimization, balancing competing objectives such as throughput maximization, latency minimization, energy efficiency, and fairness [
4].
This work introduces a scalable multi-agent reinforcement learning (MARL) framework for priority-aware spectrum management (PASM) in H-IoT systems operating under URLLC constraints. Building on the PASM model, we evaluate a comprehensive set of learning strategies, including Q-Learning [
5], Double Q-Learning [
6], Deep Q-Network (DQN) [
7], Actor–Critic [
4], Dueling DQN [
8], and Proximal Policy Optimization (PPO) [
9]. The choice of these six schemes is deliberate: tabular methods serve as lightweight baselines with low computational cost, deep value-based methods capture complex non-linear dynamics, and policy-gradient methods enhance stability and scalability. By benchmarking these complementary schemes within a unified framework, MARL-PASM highlights their trade-offs and addresses a key gap in the literature where most studies focus on evaluating a single algorithm in isolation.
Our objective is to design and benchmark scheduling policies that jointly improve throughput, delay, fairness, and energy efficiency under realistic traffic and channel conditions while remaining computationally practical for real-time H-IoT deployments. The key contributions of this work are as follows:
We introduce MARL-PASM, a scalable multi-agent reinforcement learning framework for priority-aware spectrum management in H-IoT systems, explicitly designed to meet URLLC requirements by balancing throughput, delay, energy efficiency, and fairness.
We provide a comprehensive benchmarking study by implementing six representative RL strategies: Q-Learning, Double Q-Learning, DQN, Actor–Critic, Dueling DQN, and PPO within a unified PASM framework. This allows systematic comparison across tabular, deep value-based, and policy-gradient methods.
We integrate fairness-aware reward modeling and dynamic class-based prioritization into the framework, enabling adaptive and priority-sensitive scheduling under heterogeneous traffic and realistic wireless channel conditions.
We conduct extensive simulations across varying network sizes (3 to 50 devices) and heterogeneous traffic distributions, reporting consolidated results on throughput, delay, energy efficiency, fairness, blocking probability, convergence behavior, and training cost. These results reveal algorithmic trade-offs and identify PPO as the most promising scheme for large-scale H-IoT deployments.
The rest of the manuscript is organized as follows.
Section 2 reviews recent advances in H-IoT spectrum management and RL applications in wireless networks.
Section 3 presents the system model, including network assumptions, state–action representation, and reward design.
Section 4 details the proposed MARL-PASM framework and learning strategies.
Section 5 outlines the simulation setup and parameters.
Section 6 analyzes results, interprets key findings, and provides future research directions. Finally,
Section 7 concludes the paper.
2. Literature Review
H-IoT networks are designed to support real-time monitoring, emergency response, and intelligent diagnostics through the deployment of connected medical sensors and wearable devices. These networks handle mission-critical data streams, ranging from continuous vital sign monitoring to emergency alerts, and are foundational to smart healthcare delivery systems. However, achieving the stringent QoS requirements such as URLLC poses serious challenges due to spectrum scarcity, dynamic traffic patterns, and interference in densely deployed environments [
10,
11].
5G is explicitly designed to address these challenges through support for URLLC, offering air-interface latencies as low as 1 ms and reliability guarantees exceeding
in 3GPP Release 16 [
12]. However, traditional static or semi-static spectrum management schemes fall short in coping with the complexity and dynamicity of H-IoT traffic. Therefore, intelligent and adaptive solutions, particularly those leveraging RL, have gained increasing attention as they provide the ability to autonomously learn optimal access and scheduling policies in real-time.
RL is now an emerging paradigm for Dynamic Spectrum Access (DSA) to allow devices and network agents to learn to select spectrum channels autonomously using adaptive feedback-based selection. Q-Learning has been employed in previous work to equip devices with the ability to learn channel access within cognitive radio environments [
13,
14]. These model-free RL algorithms improved the utilization of the spectrum by allowing devices to learn interference-aware access patterns over time. Current advances have also aimed at deep reinforcement learning (DRL) to leverage deep neural networks to control the high-dimensional state and action spaces typical in IoT environments. The authors in [
15,
16] proposed DRL-based decentralized architectures that significantly improved spectrum sharing efficacy without relying on central control. Such advances are relevant to H-IoT, wherein distributed healthcare devices ought to be autonomous to operate under congested and interference-heavy conditions. MARL techniques have also increasingly been applied in healthcare. The authors in [
17] introduced a bi-level MARL framework using cooperative learning and game theory to enhance fairness and system throughput. Cooperative MARL models such as these have been successful in large cognitive radio networks [
18], validating the scalability of RL in H-IoT settings. The hierarchical and federated RL is developed in [
19], providing a privacy-protecting federated RL for IoMT systems with UAV support. Such approaches enable spectrum allocation with zero intrusion into sensitive healthcare data, paving the way for secure, scalable H-IoT systems.
Besides reliability and throughput, fairness and energy efficiency are critical metrics in H-IoT networks, especially for battery-powered medical sensors. DRL solutions like [
20,
21] demonstrate that energy-efficient RL policies can double device lifetimes without compromising latency and reliability. These schemes control task offloading and transmission adaptively in real time to realize an optimal energy-delay trade-off. Another essential aspect is fairness. The study in [
17] incorporated Jain’s Fairness Index directly into the reward function of a MARL system to ensure equitable bandwidth allocation among devices. But most current frameworks are oblivious to fairness, leading to low-priority medical devices being denied service. Future research must include fairness as one of the key optimization goals, along with latency and throughput. Ensuring URLLC in IoT and particularly in dynamic H-IoT environments is possibly the toughest objective. Actor–Critic and policy-gradient algorithms have been explored to meet probabilistic latency constraints [
22]. The study in [
23] has demonstrated the use of RL to enforce dynamic resource slicing for healthcare applications, and once more, this highlights the practicality of RL in meeting stringent QoS requirements in healthcare systems.
Advanced RL models such as Double DQN, Dueling DQN, and Actor–Critic methods have started to gain visibility in the wireless communication domain. These architectures offer advantages such as reduced overestimation bias, better value function approximation, and improved stability. The study in [
24] demonstrated that Dueling DQN outperformed both vanilla and Double DQN in 5G network slicing scenarios. Such architectures are underexplored in healthcare-specific spectrum management. Existing H-IoT works primarily rely on conventional RL algorithms, and there is minimal integration of advanced methods tailored to the healthcare context. This presents new opportunities to adopt and adapt state-of-the-art RL techniques in H-IoT systems.
Other relevant works include the distributed RL-based spectrum allocation framework in [
25], which applies a distributed multi-agent approach to cognitive IoT environments. This method improves scalability and adaptability in dynamic spectrum settings but does not explicitly address medical QoS requirements such as URLLC. Similarly, Ref. [
26] proposed an RL-based routing approach for cognitive radio-enabled IoT communications, focusing on optimal route selection and interference mitigation. While effective in improving network throughput, the scheme overlooks heterogeneous device constraints and priority-based scheduling needed in H-IoT.
As summarized in
Table 1, most prior studies optimize isolated performance metrics and rely on simplified scenarios, limiting their applicability to dense and heterogeneous H-IoT deployments. To bridge these gaps, this study proposes a unified MARL-based PASM framework that jointly evaluates throughput, delay, energy efficiency, fairness, and blocking probability under URLLC demands, offering a realistic and scalable benchmark across six reinforcement learning strategies.
3. System Model
This section outlines the system model for the proposed RL-based spectrum management framework in an H-IoT environment, as shown in
Figure 2. The architecture consists of multiple parallel channels shared among heterogeneous H-IoT devices operating in discrete time slots. The model encapsulates the network dynamics, state–action representation, and reward design to guide RL agents in decision-making.
We consider a time-slotted wireless communication environment with
C orthogonal channels and
N H-IoT devices. Each channel
supports one active transmission per slot. Devices are classified into three distinct priority classes, denoted by the set
, indexed as
respectively, with EmergencyAlert traffic having the highest priority. At each time slot
, one device sends a request for spectrum access. The request includes the device class, identity, and current context. The system observes this request and must decide on an appropriate action governing spectrum allocation. Let the system state at time
t be denoted as
, defined as a tuple:
where the
is the current channel index selected for access,
denotes the priority class of the requesting device, and
is the action taken in the previous time step. The action space
comprises five discrete actions:
where the action “Deny” rejects the spectrum access request; “Grant” approves the request and allocates the channel exclusively. “Preempt” revokes the access of a lower-priority device to grant the current request. The “Coexist” action is modeled as concurrent channel use with reduced SINR, activated when coexistence is permitted in the environment configuration. Finally, “Handoff” migrates the device to a different available channel. The total state space grows as
.
Reward Function
The reward function is designed to promote key objectives of H-IoT systems: high throughput, energy efficiency, fairness among device classes, and low latency for critical devices. Let
denote the scalar reward at time
t, computed as
where
is the normalized throughput achieved in the current time step,
is the normalized energy cost of spectrum access.
is an indicator function providing a reward bonus if the currently served class is the least served so far, encouraging service diversity.
and
are weight parameters that balance the contribution of throughput, fairness, energy cost, and class equity, respectively, and
is the instantaneous fairness index, computed using Jain’s index [
27]:
Here,
captures proportional fairness across device classes, while the indicator term provides an additional incentive to serve the least-attended class, ensuring diversity in scheduling decisions.
is the cumulative payload volume successfully delivered to class
k across the episode, normalized by the number of time slots. Access counts and per-slot rewards are accumulated via environment feedback and aggregated at episode end. This reward formulation promotes efficient resource use while ensuring fairness and prioritizing underserved devices. The agent is trained to maximize the expected discounted cumulative reward [
28]:
where
is the policy mapping states to actions and
is the discount factor.
4. Proposed MARL-PASM Framework
In this section, we detail the design of the proposed MARL-PASM framework for H-IoT environments. The framework integrates six RL schemes, namely tabular Q-Learning, Double Q-Learning, Actor–Critic, DQN, Dueling DQN, and PPO. These agents learn policies to dynamically allocate spectrum across heterogeneous medical devices with varying priorities. The overall structure of the framework is illustrated in
Figure 3, while the generic training procedure that underpins all schemes is outlined in Algorithm 1.
Algorithm 1 Generic MARL–PASM training loop |
- 1:
Initialize environment with N devices and traffic classes - 2:
Initialize learner parameters for the chosen scheme - 3:
for episode to E do - 4:
Reset environment and obtain initial state s - 5:
for time step to T do - 6:
Select action a using exploration policy - 7:
Apply a in , observe reward r, next state - 8:
Store in buffer or trajectory memory - 9:
Learner update: - 10:
if Scheme is Q-Learning then update with TD rule - 11:
else if Scheme is Double Q then update with decoupled select–evaluate - 12:
else if Scheme is DQN then sample mini-batch from buffer and take a gradient step on TD loss - 13:
else if Scheme is Dueling DQN then update value and advantage heads via TD loss - 14:
else if Scheme is Actor–Critic then update policy with advantage estimate and update critic by value loss - 15:
else if Scheme is PPO then compute advantages and optimize the clipped surrogate objective with value loss and entropy bonus - 16:
- 17:
if episode terminates then - 18:
break - 19:
end if - 20:
end for - 21:
Log episode metrics: throughput in Mbps, delay in ms, energy efficiency in Gbits/J, fairness, blocking probability, utilization, training time - 22:
end for - 23:
return trained policy and recorded metrics
|
In the following subsections, each of the six reinforcement learning schemes integrated within MARL-PASM is described in detail. These include tabular Q-Learning, Double Q-Learning, DQN, Dueling DQN, Actor–Critic, and PPO, with emphasis on their learning mechanisms and update rules.
4.1. Tabular Q-Learning and Double Q-Learning
Tabular Q-Learning maintains a value table
representing the expected long-term reward of taking action
a in state
s and following the current policy thereafter [
29]. It updates values using the rule
where
is the learning rate and
is the discount factor. This approach is simple and interpretable, but scales poorly with large state–action spaces.
To mitigate overestimation bias, Double Q-Learning [
30] introduces two independent estimators
and
:
Both methods use the discrete state representation defined in
Section 3. These algorithms serve as baseline models for comparison with neural approaches.
4.2. Actor–Critic Q-Learning
Actor–Critic methods [
29] decompose the policy learning into two components: the actor
selects actions based on the current policy, and the critic
estimates the value function to guide policy updates. We use a tabular Actor–Critic variant where the critic updates the state-value estimate using
The actor updates the policy via preference values
, typically using a softmax policy:
The preference is updated as
where
is the temporal-difference error.
Actor–Critic methods tend to converge faster in dynamic environments by decoupling policy and value updates, which is beneficial for H-IoT systems with mixed device demands.
4.3. Deep Q-Network (DQN)
To address the scalability limitations of tabular methods, we adopt DQN, which approximates
using a neural network
[
28]. The network is trained to minimize the temporal difference loss:
where
denotes the target network parameters updated periodically for training stability.
DQN employs experience replay, storing transitions in a buffer and sampling mini-batches for training. Input states are encoded using one-hot encoding for device class, current channel, and last action. This enables the agent to generalize across diverse traffic scenarios.
4.4. Dueling Deep Q-Network (Dueling DQN)
Dueling DQN [
28] enhances learning by decomposing
into two separate estimators:
where
represents the state value and
the advantage of action
a in state
s. This architecture helps the agent identify important states independently of the action, improving learning efficiency and policy robustness.
The dueling architecture employs two neural network branches sharing initial layers, with one estimating and the other . The aggregated output provides the final values for action selection.
4.5. Proximal Policy Optimization (PPO)
PPO is a policy-gradient reinforcement learning algorithm designed to achieve stable and reliable policy updates, particularly in complex or high-dimensional environments [
9]. PPO optimizes a clipped surrogate objective that constrains the policy update within a predefined trust region, preventing destructive large-step updates and improving convergence stability. The policy and value functions are both parameterized using deep neural networks, with the actor network producing a probability distribution over actions and the critic network estimating the state-value function.
The clipped objective function used in PPO is given by
where
is the probability ratio between the new and old policies,
is the advantage estimate at time
t, and
is the clipping parameter. This formulation limits the policy update size while encouraging improvement only when it aligns with the advantage estimate.
In this work, PPO is implemented in an on-policy setting with mini-batch stochastic gradient descent updates. We adopt an adaptive Kullback–Leibler (KL) divergence penalty and entropy regularization to balance exploration and exploitation. Hyperparameters such as learning rate, clipping range, discount factor, and update frequency are tuned to ensure a fair comparison with the other MARL schemes. By incorporating PPO into the MARL-PASM framework, we aim to evaluate its potential for improving adaptability, convergence speed, and policy robustness in heterogeneous H-IoT scenarios.
5. Experimental Setup
The proposed MARL-PASM framework is evaluated in a discrete-time, event-driven simulator that emulates realistic H-IoT operating conditions with heterogeneous device classes, latency-sensitive traffic, and energy-constrained spectrum access. Each device belongs to one of three clinically inspired classes with distinct transmission power limits, sampling rates, payload sizes, and battery capacities. Class-dependent energy-per-bit costs and mobility-induced throughput penalties directly influence packet arrivals, achievable rates, and energy depletion, allowing MARL agents to learn priority-aware scheduling under realistic constraints. All agents share a spectrum divided into five orthogonal channels, with each episode modeling sequential access, contention resolution, and dynamic decision-making. Simulations were implemented in Python 3.13 and executed on a 64-bit workstation. Hyperparameters such as learning rate, discount factor, and exploration decay were tuned for each scheme within consistent ranges to ensure stability and fairness in comparison.
The state space provided to each RL agent includes not only the device class, channel index, and previous action, but also the instantaneous queue occupancy and recent SINR statistics, which capture short-term backlog conditions and interference variability. This augmentation provides the agents with a more context-aware view of the environment without substantially increasing the state dimensionality.
Table 2 summarizes the physical-layer and network-level parameters. Where applicable, values are adapted from the IEEE TMLCN [
31] study to ensure realism in the assumed channel and transmission models.
The Baseline Scenario uses
devices to study reward dynamics and convergence behavior in a controlled setting. The Scalability Sweep increases the network size up to
devices, randomly assigned to classes with the stated proportions, to emulate dense hospital or smart clinic deployments with significant spectrum contention. Device heterogeneity in energy budgets, sampling rates, transmission power constraints, and payload sizes follows the class profiles described in
Table 2, ensuring that class-specific operational lifetimes and performance trade-offs are faithfully reflected. The 30–40–30 distribution was selected to mirror realistic H-IoT traffic patterns. Medium-priority transmissions, which are
dominant due to continuous monitoring and diagnostic data, while high-priority emergency alerts, which are
, and low-priority background updates
, occur less frequently. This balance provides a representative workload for evaluating spectrum management policies under heterogeneous demand. All results are averaged over five independent runs with different random seeds to ensure statistical reliability.
Traffic models incorporate both periodic and bursty packet arrivals, with emergency devices capable of generating irregular high-priority bursts, glucose monitors producing low-rate continuous readings, and fitness trackers generating moderate-rate periodic data. These patterns, coupled with class-specific transmit powers, payload sizes, and energy budgets, emulate realistic medical traffic characteristics observed in hospital and home-care environments. Channel conditions include additive white Gaussian noise, log-normal shadowing, and distance-dependent path loss; no inter-channel interference is assumed for orthogonal allocation. Performance is evaluated using average throughput, delay, energy efficiency, Jain’s fairness index, blocking probability, interruption probability, convergence time, and training time, enabling a comprehensive comparison of algorithmic trade-offs under realistic H-IoT conditions. For PPO, we adopt a clipped surrogate objective with , learning rate , discount factor , update frequency of 4 epochs per batch, and entropy coefficient , ensuring a balance between convergence speed and policy stability for fair comparison with other RL schemes.
6. Results and Discussion
This section presents a comprehensive evaluation of MARL-PASM against six reinforcement learning schemes, focusing on baseline behavior, scalability under varying network sizes, distributional performance, and convergence/complexity trade-offs. The analysis connects observed trends directly to URLLC requirements, highlighting how MARL-PASM addresses throughput, delay, fairness, energy efficiency, and blocking probability in H-IoT networks.
6.1. Learning Dynamics and Baseline Performance
This section analyzed the learning behavior of the six reinforcement learning algorithms, using three fundamental measures: the moving average of reward per episode, the convergence of maximum Q-values, and detailed PPO training diagnostics.
Figure 4 presents the moving average of rewards across 1000 training episodes. PPO consistently demonstrates superior reward learning, reaching values close to 95–100 and converging quickly within the first 200 episodes. Actor–Critic also performs strongly, stabilizing around 85–90. Dueling DQN achieves stable returns of approximately 80–85, while Q-Learning, Double Q-Learning, and DQN remain in the range of 70–80, reflecting their relatively slower convergence under complex H-IoT traffic conditions.
The Q-value convergence trends are depicted in
Figure 5. DQN achieves the highest stability, plateauing around 25–27, followed by Actor–Critic, which stabilizes near 22–24. Q-Learning and Double Q-Learning converge to more modest values of 15–16 and 5–6, respectively, showing their conservative policy updates. Interestingly, Dueling DQN, despite performing well in terms of rewards, shows consistently low Q-value estimates (near 0–1), which suggests that its advantage decomposition emphasizes relative action evaluation rather than absolute Q-value scaling. It is important to note that Q-value convergence is not directly applicable to PPO, as it does not rely on explicit Q-value estimation. Instead, PPO stability is assessed via its diagnostics (policy loss, value loss, entropy, and KL divergence), shown in
Figure 6. This explains the addition of a dedicated PPO diagnostic figure alongside reward and Q-value plots.
To further analyze PPO,
Figure 6 provides detailed diagnostics. The policy loss remains tightly bounded around zero, confirming stable clipped surrogate updates. The value loss fluctuates between 2000 and 3000, reflecting the critic’s effort to adapt to diverse traffic and channel conditions. Policy entropy remains between 0.5 and 1.5, indicating sustained exploration during training. Finally, the KL divergence between old and new policies stays mostly below 0.015, ensuring that PPO maintains stability without catastrophic policy shifts. The combination of high reward and stable convergence trends highlights the strong learning potential of PPO and Actor–Critic in spectrum management for H-IoT. DQN also shows promising convergence but with slightly lower returns. Meanwhile, Q-Learning and Double Q-Learning remain suitable for lightweight deployments where simplicity and stability are prioritized, though they underperform in complex dynamic environments. Dueling DQN demonstrated good reward performance but limited Q-value growth, suggesting that its structural advantage is more beneficial for relative action selection than absolute value scaling.
6.2. Scalability and Robustness
To evaluate the framework’s robustness in dense H-IoT scenarios, we simulate scaling behavior from 5 to 50 devices.
Figure 7,
Figure 8,
Figure 9,
Figure 10 and
Figure 11 illustrate the trends in throughput, delay, fairness, blocking probability, and training time across all six reinforcement learning methods.
Figure 7 highlights that Dueling DQN sustains the highest throughput, stabilizing around 36–37 units regardless of scale, followed closely by DQN at 34–35 units. In contrast, Q-Learning, Double Q-Learning, and Actor–Critic all experience sharp throughput declines below five units as device count increases. PPO performs well under light load (≈28 units at five devices) but quickly degrades under congestion, converging near one to two units at high density. Delay trends, shown in
Figure 8, reinforce these findings. PPO demonstrates the lowest and most consistent delay (~2 ms), while Dueling DQN achieves low and stable latency (14–15 ms). Q-Learning, Double Q-Learning, and DQN stabilize in the range of ~16–19 ms. Actor–Critic, however, suffers from the worst delay, saturating near 99 ms, reflecting instability in its value estimation under scale.
Fairness outcomes in
Figure 9 show that Dueling DQN consistently achieves the highest fairness index (~0.65), followed by DQN (~0.62). PPO lags behind significantly, rarely exceeding 0.2, indicating a trade-off between its low blocking probability and equitable resource allocation. Q-Learning and Double Q-Learning exhibit fairness dips at intermediate scales but recover to ~0.25–0.3. Actor–Critic, though initially poor, improves beyond 0.35 at 50 devices, suggesting late adaptivity. Blocking probability results (
Figure 10) demonstrate that PPO achieves the lowest blocking (~0.01), outperforming all other methods. Dueling DQN follows at 0.07–0.08, and DQN at 0.09–0.10, while Q-Learning, Double Q-Learning, and Actor–Critic remain significantly higher around 0.17–0.18.
Training time comparisons (
Figure 11) add another layer of insight. Tabular Q-Learning, Double Q-Learning, and Actor–Critic remain lightweight, with runtimes near 1 s. PPO shows moderate computational overhead (~30–35 s), while DQN requires ~55 s, and Dueling DQN incurs the highest cost (~80–90 s). This highlights the trade-off: PPO is efficient in latency and blocking but less fair, while Dueling DQN demands more computation yet provides the most balanced overall scalability. Taken together, these results suggest that Dueling DQN and PPO offer complementary strengths. Dueling DQN achieves robust fairness, high throughput, and low delay, while PPO minimizes blocking and delay but struggles with fairness under scale. For fairness-critical and high-utilization H-IoT deployments, Dueling DQN is preferable, whereas PPO may be suitable for ultra-reliable, latency-sensitive applications where fairness is less critical.
6.3. Distributional Performance Insights
This section provides a detailed distributional analysis of key performance indicators using cumulative distribution functions (CDFs) and corresponding mean bar plots. The performance of each MARL variant is evaluated with respect to throughput, delay, energy efficiency, fairness, channel utilization, and interruption probability.
6.3.1. Throughput Distribution
Figure 12 and
Figure 13 present the throughput distribution (CDF) and mean throughput across all schemes. The results reveal a more balanced landscape compared to earlier findings. From the CDF in
Figure 12, all reinforcement learning schemes exhibit relatively close distributions, with slight shifts in tail behavior. PPO demonstrated a slightly delayed rise in its CDF curve, indicating that while it achieves higher peak throughput in certain instances, its distribution is less consistent at lower percentiles compared to DQN and Dueling DQN.
The mean throughput comparison in
Figure 13 highlights PPO as the best performer, achieving an average throughput of approximately
Mbps. Dueling DQN and Actor–Critic follow closely at around
–
Mbps, confirming their robustness under varying traffic loads. In contrast, Q-Learning, Double Q-Learning, and DQN yield lower mean throughput values (around
–
Mbps), reflecting the limitations of tabular and standard value-based approaches in dynamic environments. These findings demonstrate that PPO and Dueling DQN sustain higher throughput under realistic load conditions, directly supporting life-critical data transmission in H-IoT.
6.3.2. Delay Distribution
Figure 14 and
Figure 15 illustrate the comparative delay distributions across all six learning schemes. The results clearly indicate that PPO achieves the lowest delays, with a mean delay of approximately 7 ms and its CDF rising steeply, demonstrating that the majority of transmissions complete within very low latency bounds. This highlights PPO’s efficiency in maintaining stability under varying network conditions.
Q-Learning and Double Q-Learning exhibit moderate performance, averaging around 10–12 ms, while Dueling DQN maintains a slightly higher mean of 12 ms. DQN lags further behind with an average delay close to 14 ms, showing less robustness in latency-sensitive scenarios. Actor–Critic, although computationally lightweight, continues to underperform with delays frequently exceeding 80 ms in tail cases, reflecting instability in value estimation and gradient updates.
Importantly, tail performance analysis shows that PPO reduces the 95th-percentile delay by approximately 11.5% compared to the next best scheme. This confirms that PPO not only minimizes average latency but also effectively suppresses extreme delay outliers, which is crucial for URLLC scenarios where millisecond-level guarantees are decisive.
6.3.3. Energy Efficiency Analysis
Figure 16 and
Figure 17 illustrate the distribution and mean values of energy efficiency across all six schemes. PPO achieves the highest mean energy efficiency, reaching nearly
bits/Joule, outperforming all other algorithms. This highlights PPO’s ability to optimize both spectral utilization and power consumption simultaneously. Dueling DQN and DQN follow closely, with mean efficiencies of
and
bits/Joule, respectively, demonstrating the strength of deep neural approximators in sustaining efficient policies under high-load H-IoT conditions.
Tabular Q-Learning and Double Q-Learning exhibit comparatively lower efficiency, around and bits/Joule, reflecting their limited adaptability in dynamic spectrum environments. Actor–Critic records the lowest performance among function approximators, at approximately bits/Joule, consistent with its weaker convergence behavior observed in earlier subsections.
6.3.4. Fairness Evaluation
Figure 18 and
Figure 19 demonstrate that all six schemes achieve consistently high fairness, with mean values clustered in the narrow range of
–
. Unlike earlier observations where Dueling DQN appeared dominant, the updated results reveal only marginal differences among the algorithms. PPO achieves the highest fairness at approximately
, closely followed by Dueling DQN and Q-Learning. Double Q-Learning and DQN exhibit nearly identical fairness levels, while Actor–Critic also maintains comparable equity in resource allocation. The convergence of fairness indices across schemes highlights that modern RL-based policies are capable of ensuring socially fair spectrum access even under dense vehicular IoT conditions. This suggests that throughput and delay optimization do not come at the expense of fairness, reinforcing the robustness of the proposed PASM framework. The convergence of fairness across schemes indicates that MARL-based policies can improve efficiency without compromising the equity of spectrum access.
6.3.5. Channel Utilization and Interruption Risk
Figure 20 and
Figure 21 present the trade-off between channel utilization and interruption probability across all six schemes. The results show that PPO achieves the highest mean channel utilization, approaching 0.19 (normalized), with consistently wider distribution bounds. Dueling DQN and Actor–Critic follow closely, maintaining utilization above 0.18, whereas Q-Learning, Double Q-Learning, and DQN show slightly lower averages (0.17–0.18). This indicates that deep policy-based methods allocate resources more aggressively and effectively in congested conditions.
However, higher utilization must be balanced against interruption risk. The CDF of interruption probability highlights that Actor–Critic exhibits the steepest distribution, reaching nearly 90% probability of interruption in extreme cases, reflecting unstable policy convergence. Q-Learning and Double Q-Learning perform moderately, but their interruption probability spreads across wider ranges, indicating less predictability. In contrast, PPO demonstrated a more gradual CDF slope, sustaining lower interruption risks relative to its higher utilization. Dueling DQN also performs robustly, offering a balanced compromise between throughput-driven utilization and stability.
6.4. Convergence vs. Computational Cost
Figure 22 showed that PPO converges in the fewest episodes (around 75) and with the lowest training time below 80 s, highlighting its efficiency in both learning and computation. In contrast, Actor–Critic converges in nearly 440 episodes and incurs the largest training cost of approximately 250 s, revealing instability in its policy gradient updates. Among deep Q-learning variants, Dueling DQN and DQN converge in approximately 340–390 episodes, slightly slower than tabular methods but more stable once converged. Tabular Q-Learning and Double Q-Learning require around 310–325 episodes, but with negligible training overhead below 25 s, which makes them lightweight but less robust at scale.
Beyond training convergence, we also assessed inference complexity across schemes. Tabular methods required negligible memory, under 1 kB, and delivered near-instant decision latency, yet they struggled to scale in accuracy and stability. DQN and Dueling DQN used model sizes in the tens of kilobytes and achieved inference times around 0.02 to 0.03 ms per decision, yielding thousands of decisions per second on commodity hardware. PPO used the largest model but remained highly efficient, with inference latency close to 0.01 ms and the highest decision throughput. Actor–Critic fell between these extremes, with a footprint of 60k parameters and inference time of 0.04 ms per decision, which is larger than tabular methods but slightly higher in latency than PPO.
Figure 23 further evaluates the blocking probability. PPO delivered the lowest blocking probability of around 0.02 with limited variance, demonstrating its ability to ensure reliable service continuity. Actor–Critic also achieves a low average blocking near 0.03, but its higher variance suggests inconsistency across traffic scenarios. Dueling DQN and Q-Learning maintain moderate blocking rates around 0.045, while Double Q-Learning and DQN record the highest blocking rates between 0.055 and 0.06, which may hinder performance under heavier loads.
6.5. Holistic Performance Comparison
Table 3 consolidates the performance of all six schemes across throughput, delay, energy efficiency, fairness, blocking probability, convergence behavior, and training time. The results reveal that PPO consistently outperforms its counterparts by delivering the highest throughput of 680 Mbps, the lowest average delay of 9.2 ms, and the strongest energy efficiency of 15 gigabits per joule. It also achieves the lowest blocking probability of 0.02 and the fastest convergence at 75 episodes, balancing stability with computational efficiency. Among the deep Q-learning methods, Dueling DQN emerges as the most competitive alternative. It offers balanced throughput of 640 Mbps, low delay of 10.4 ms, high energy efficiency of 13.4 gigabits per joule, and fairness of 0.823. Although its training time of 190 s is higher than PPO, its robustness makes it a suitable candidate for applications where policy stability is critical. Standard DQN showed slightly weaker delay and blocking performance, though it maintains strong throughput and energy metrics.
The tabular methods, Q-Learning and Double Q, demonstrate lower computational overhead but lag in throughput, energy efficiency, and blocking probability, which highlights their limited scalability in complex H-IoT environments. Actor–Critic performs moderately, with reasonable fairness and blocking values, but its slower convergence of 440 episodes and higher training time of 250 s make it less practical for real-time deployments. These findings underscore PPO as the most promising scheme for large-scale H-IoT spectrum management, with Dueling DQN as a strong deep reinforcement learning alternative. The comparative analysis validates the trade-off between algorithmic complexity and operational performance, providing insights into the selection of appropriate strategies for different deployment scenarios. These results show that evaluating diverse RL schemes in a unified PASM framework reveals complementary strengths that were previously overlooked in single-scheme studies. This integrated perspective constitutes the key novelty of our work and provides actionable insights for selecting spectrum policies in practical H-IoT deployments.
Table 3 presents the averaged numerical results for all evaluated schemes, complementing the graphical analysis and enabling direct quantitative comparison.
6.6. Practical Considerations for Deploying MARL in H-IoT Networks
Although the proposed MARL-PASM framework performs satisfactorily under the simulated scenario, its implementation as a real-world H-IoT deployment brings along several practical problems that require careful scrutiny. In practice, it is not simple to estimate the global or shaped reward signals within a real network. Centralized orchestration through an edge server can be used to gather observations and perform feedback computation on the basis of statistics such as latency violations, interference levels, or successful data delivery. However, centralized reward computation introduces additional delay and scalability issues. Furthermore, coordination between MARL agents, which is very important for learning stability and convergence, creates non-trivial communication overhead, particularly when bandwidth and latency are limited. To counteract this, real-world systems have the option to utilize decentralized training, federated MARL, or low-communication architectures in order to sacrifice coordination in favor of efficiency. Partial observability is another significant problem. H-IoT nodes usually operate with partial environmental awareness, which can reduce convergence and policy robustness. Techniques such as belief-state modeling or recurrent policy learning can offer robustness under such limitations. Training complexity is also a concern; real-world agents are either trained offline in digital twin platforms or finetuned in a lightweight way to remain computationally manageable. Finally, the network stack imposes a constraint on design. As mentioned in recent research such as [
31], MAC-layer learning platforms must incorporate realistic access delays, collision dynamics, and signaling limitations in order to remain deployable. These factors guide the research roadmap in the future, in which scalability, feasibility, and protocol-awareness must be co-optimized to render MARL-based spectrum access nearly possible for clinical and mission-critical H-IoT applications.
6.7. Future Work
In future extensions of this work, we will move beyond the fixed three-tier device classification used here and adopt context-aware dynamic prioritization. This will allow devices to adjust their priority in real time based on physiological readings, patient location, or temporal urgency, thereby more accurately reflecting real-world medical requirements. We will design lightweight classifiers to enable this dynamic priority assignment while ensuring low computational overhead, which is essential for resource-constrained H-IoT deployments. Another direction we will pursue is the development of hybrid reinforcement learning architectures that merge complementary schemes. For example, we will combine the stable policy updates of PPO with the rapid value-based learning of DQN variants, or integrate Actor–Critic frameworks with double estimators to mitigate bias. These hybrid approaches will be designed to improve adaptability, accelerate convergence, and enhance policy robustness in complex and rapidly changing H-IoT environments. We will also extend the state space to include long-term temporal patterns such as time-of-day usage cycles and patient-specific activity trends, enabling agents to anticipate and adapt to predictable fluctuations in healthcare traffic. Future work will also extend PASM with classical scheduling baselines such as priority schedulers, TDMA, and proportional fair allocation, enabling direct head-to-head benchmarking with RL-based methods.
In addition, we will expand MARL-PASM to handle extreme-event traffic patterns, such as scenarios where many devices simultaneously transmit emergency data or where traffic dynamics shift abruptly. This will allow us to stress-test the framework and assess its scalability under highly dynamic and challenging conditions. We will also incorporate explicit fault tolerance evaluation, covering sudden device disconnections, overlapping transmissions, high packet loss rates, and rapid energy depletion. These experiments will provide a deeper understanding of the system’s resilience and reliability under adverse operating environments. Furthermore, we will integrate privacy-preserving mechanisms, with a focus on Federated Multi-Agent Reinforcement Learning, where agents learn locally and share only model updates. By coupling this with differential privacy, we will ensure sensitive patient data remains protected while maintaining effective collaborative learning, thereby meeting strict medical data protection requirements. Finally, we will re-implement and benchmark MARL-PASM directly against advanced distributed spectrum allocation and RL-based routing models to enable consistent, head-to-head comparisons. This will ensure that our framework is rigorously evaluated against state-of-the-art solutions and highlights its contributions to the broader landscape of spectrum management in healthcare IoT systems.
7. Conclusions
This study presented the MARL-PASM framework for dynamic spectrum allocation in H-IoT environments, addressing the challenges of heterogeneous traffic classes and stringent QoS demands through centralized, intelligent decision-making. By benchmarking six RL strategies, including PPO, we demonstrated that advanced DRL models deliver clear performance advantages. PPO consistently achieved the highest throughput, lowest latency, strongest energy efficiency, and lowest blocking probability, while Dueling DQN emerged as the most balanced deep Q-learning alternative, sustaining robust performance under dense network conditions. Actor–Critic offered fast convergence under lighter traffic loads, and the tabular methods, though computationally efficient, lacked scalability in complex deployments. Scalability and robustness evaluations confirmed the ability of DRL methods to maintain stability and fairness as device density increases, highlighting their potential for real-world H-IoT applications.
Despite these promising outcomes, several limitations remain. The current framework employs fixed device classes and static priority distributions, does not fully account for extreme emergency surges, and has not been rigorously stress-tested for fault tolerance under conditions such as device disconnections, interference spikes, or energy depletion. Privacy-preserving mechanisms such as federated MARL are also yet to be explored, and validation using real-world, trace-driven datasets is still required. Addressing these gaps will involve incorporating dynamic prioritization, hybrid RL architectures, explicit resilience testing, and privacy-preserving learning techniques. Future work will also benchmark MARL-PASM directly against state-of-the-art distributed spectrum allocation and RL-based routing models to ensure a rigorous comparative evaluation. By advancing along these directions, MARL-PASM can evolve into a robust, scalable, and secure spectrum management framework capable of supporting the safety-critical requirements of next-generation H-IoT communication systems.