Regional Strategy Composition: A Hierarchical-Action Reinforcement Learning Framework for Dynamic Smart-Meter Association over 5G NR mMTC Networks

Al-Ali, Muhammed; Inga, Esteban; Inga, Juan; Yaacoub, Elias

doi:10.3390/fi18070337

Open AccessArticle

Regional Strategy Composition: A Hierarchical-Action Reinforcement Learning Framework for Dynamic Smart-Meter Association over 5G NR mMTC Networks

¹

Department of Computer Science and Engineering, Qatar University, Doha P.O. Box 2713, Qatar

²

Smart Grids Research Group (GIREI), Universidad Politécnica Salesiana, Cuenca 010102, Ecuador

³

Telecommunications and Telematic Research Group (GITEL), Universidad Politécnica Salesiana, Cuenca 010102, Ecuador

^*

Author to whom correspondence should be addressed.

Future Internet 2026, 18(7), 337; https://doi.org/10.3390/fi18070337 (registering DOI)

Submission received: 26 May 2026 / Revised: 16 June 2026 / Accepted: 22 June 2026 / Published: 25 June 2026

(This article belongs to the Special Issue Artificial Intelligence in Smart Grids)

Download

Browse Figures

Versions Notes

Abstract

Advanced Metering Infrastructure (AMI) over 5G New Radio (NR) massive machine-type communication (mMTC) networks require efficient and adaptive communication mechanisms to support reliable data delivery for large numbers of smart meters under dynamic traffic and channel conditions. In this work, we propose a framework in which each smart meter chooses, at runtime, whether to transmit directly to the base station (BS) or via a nearby Data Aggregation Point (DAP). The optimal choice is dynamic and depends on DAP buffer occupancy, periodic congestion, channel quality, and packet deadline pressure. Formulating this as a per-meter binary decision yields an action space of size

2^{N}

for N meters, which is intractable for reinforcement learning (RL). We reformulate the problem as regional strategy composition: the RL agent selects one parameterized association strategy for each DAP region from a small library of interpretable rules, and a deterministic mapping expands the regional choice into per-meter modes. It reduces the policy action space from

2^{N}

to

K^{D}

, where D is the number of DAPs and K the number of strategies, while preserving meter-level control granularity. We evaluate Proximal Policy Optimization (PPO) and Deep Q-Network (DQN) controllers against eight meter-level baselines on a 5G NR-calibrated simulator with 1500 m, six DAPs, deadline-bounded delivery, stale channel-state information, and phase-offset congestion cycles. Across three traffic regimes and five random seeds, PPO improves packet delivery ratio (PDR) over the strongest heuristic by +0.63, +2.41, and +2.66 percentage points under baseline, high-load, and bursty-cycle conditions, respectively; all gains are statistically significant (paired t-test,

p < 0.001

; Cohen’s d up to 5.12), and the advantage grows with traffic stress. The results show that learned regional composition of classical heuristics outperforms any single fixed heuristic precisely when no individual rule is globally optimal.

Keywords:

advanced metering infrastructure; 5G NR mMTC; data aggregation point; dynamic association; reinforcement learning; hierarchical action space; hyper-heuristic; proximal policy optimization

1. Introduction

Advanced Metering Infrastructure (AMI) connects large populations of smart meters to utility back-ends for near-real-time consumption reporting, and it has become a foundational service of the smart grid [1,2]. When this connectivity is provided by a 5G New Radio (NR) massive machine-type communication (mMTC) slice, the access network must serve thousands of low-rate devices per cell under a limited pool of Resource Blocks (RBs). The mismatch between the number of meters and the available radio resources makes the resource-management policy a binding constraint on delivery performance, a difficulty that is well documented in the broader study of scheduling for mMTC and ultra-reliable low-latency traffic [3,4].

A widely adopted way to relieve this pressure is to introduce Data Aggregation Points (DAPs), intermediate nodes that collect packets from nearby meters and forward them to the base station (BS) in larger backhaul bundles [5,6,7]. The placement of DAPs and the assignment of meters to them have been studied extensively as planning problems, typically through clustering or facility-location optimization that minimizes infrastructure and communication cost subject to coverage and capacity constraints [1,2,8,9]. More recent designs couple this placement with quality-of-service-aware aggregation and traffic handling [10,11]. A common property of these works is that the meter-to-DAP mapping is computed once, at deployment time, and is not revised during operation.

Two transmission paths are therefore available to each meter. In the direct path, the meter transmits its packet to the BS over the cellular uplink; in the aggregated path, the meter transmits to a nearby DAP that buffers the packet and later forwards it. The direct path avoids DAP buffering delay but contends for scarce BS resources, whereas the aggregated path offers higher effective capacity through aggregation, but introduces a queuing delay that can cause packets to miss their deadlines when many meters use the same DAP at once. Because deadline compliance is central to metering traffic, deadline- and freshness-aware handling of mMTC flows has itself become an active research area direction [12,13,14,15].

The best path for a given meter is consequently not fixed. It depends on the current DAP buffer occupancy, the temporal congestion pattern of the DAP, the channel quality on each path, the remaining deadline of the packet, and the global traffic load. Static assignment policies cannot react to these dynamics and leave performance unrealized when conditions change. It motivates dynamic association, in which the transmission mode of each meter is re-evaluated periodically during operation rather than fixed at planning time.

Reinforcement learning (RL) is an attractive tool for such dynamic control because it learns policies directly from interaction without requiring an explicit system model, and it has been applied successfully to spectrum access, power control, scheduling, and user association in time-varying wireless networks [16,17,18,19,20,21].

The evidence, however, is nuanced. RL reliably outperforms simple heuristics such as random or max-SINR association [16,18,22], yet against strong model-based optimization with full information, it tends to approach rather than exceed the optimum, reaching roughly 90–

97 %

of exhaustive-search or Weighted Minimum Mean Square Error (WMMSE) baselines [17,23]. This pattern frames the central question of the present work: can a learned policy exceed the best-engineered heuristic for smart-meter association, and, if so, why?

A direct RL formulation that selects a binary mode for each of the N meters produces an action space of size

2^{N}

. For the populations typical of AMI, this space is far too large for stable policy gradient learning: the age to a single scalar reward for a joint decision over all meters and cannot attribute credit to individual choices, so the gradient signal is weak, and training does not converge to a useful policy. This action-space explosion is precisely the kind of difficulty that hierarchical RL is designed to address, by decomposing a decision into a high level that selects abstract actions and a low level that implements them, thereby reducing the number of decision points and the effective action-space size [24,25,26,27]. A complementary line of work, studied under hyper-heuristics and adaptive operator selection, models the control process as a Markov decision process process whose actions are choices among low-level heuristics rather than raw decisions, and learns which heuristic to apply at each step [28,29,30,31,32,33,34].

Building on these two ideas, this paper resolves the action-space problem through a hierarchical reformulation that we call regional strategy composition. Rather than choosing a mode for every meter, the RL agent chooses one parameterized association strategy per DAP region from a small library of interpretable rules: direct-priority, DAP-priority, buffer-aware, cycle-aware, and deadline-aware. A deterministic mapping then expands each regional strategy into per-meter modes using local state. The agent’s action space shrinks from

2^{N}

to

K^{D}

, where D is the number of DAP regions and K the number of strategies, while every meter still receives an individualized decision. The agent learns which heuristic to apply in which region and at which time, an ability that no single fixed heuristic possesses by construction. In effect, the controller acts as a spatially distributed, online heuristic selector, transplanting the hyper-heuristic principle from combinatorial optimization into a networking control problem.

The contributions of this paper are as follows.

We formalize the dynamic smart-meter association as a Markov decision process with deadline-bounded delivery, stale channel-state information (CSI), and phase-offset DAP congestion cycles, and we show that the naive per-meter action space is intractable.
We propose a regional strategy-composition formulation that reduces the action space from $2^{N}$ to $K^{D}$ and preserves meter-level granularity through a deterministic strategy-to-mode mapping.
We implement Proximal Policy Optimization (PPO) and Deep Q-Network (DQN) controllers under this formulation, and compare them on a single 5G NR-calibrated simulator, against eight meter-level baselines, including the strongest buffer-aware heuristic and a perfect-information cycle oracle.
We report a statistically rigorous evaluation over five random seeds and three traffic regimes, with paired significance tests and effect sizes, and we show that the learned controller’s advantage over the best heuristic grows monotonically with traffic stress.

The remainder of the paper is organized as follows. Section 2 reviews related work. Section 3 describes the system model. Section 4 formulates the association problem and the regional reformulation. Section 5 details the strategies and the learning algorithms. Section 6 describes the experimental setup. Section 7 presents results. Section 8 discusses implications and limitations, and Section 9 concludes the paper.

2. Related Work

2.1. DAP Placement and Static Meter Assignment

The placement of DAPs and the assignment of meters to them are usually posed as optimization problems that minimize infrastructure and communication cost, subject to coverage and delay constraints. Lang et al. [5] formulate a mixed-integer program that jointly minimizes installation, transmission, and delay cost, solved by an alternating clustering–assignment–routing heuristic. Sung et al. [6] discretize the service area into a grid and place relay DAPs to keep all meters within range while minimizing the DAP count. Inga et al. [8] integrate concentrator capacity, coverage, and routing into a combined sizing-and-routing model solved by graph-based heuristics.

Clustering is the dominant tool for deciding both DAP locations and meter assignment. Molokomme et al. [1] model meter positions as a spatial point process and apply K-means with a silhouette criterion to select the number of DAPs. Gallardo et al. [2] partition the grid into neighborhood-area sub-networks and define a coverage-density metric to evaluate placement. Abdullah and Ashraf [9] propose dual placement algorithms that minimize either the mean or the worst-case meter-to-DAP distance. Khan et al. [10,11] couple modified K-means clustering with QoS-aware traffic scheduling and cluster-head aggregation.

A common property of these works is that the meter-to-DAP mapping is computed once, at planning time, and is not revised during operation. Our work is complementary: we take the DAP placement as given and address the run-time decision of which path each meter should use as conditions evolve.

2.2. Reinforcement Learning for Wireless Resource Management

Deep RL has been applied to spectrum access, power control, scheduling, and user association in time-varying wireless networks [16,17,18]. For user association specifically, multi-agent Q-learning has been used for real-time load balancing and handover [19], deep Q-learning for cell-individual-offset tuning [35] and heterogeneous-network load balancing [36], tree-structured policy gradients for joint association and power control [20], and multi-agent PPO for decoupled uplink–downlink association in UAV-assisted networks [37]. Digital-twin-assisted training has been used to generalize the association policies across user counts [21].

The evidence on RL versus classical methods is nuanced and directly relevant to how we frame our results. RL reliably outperforms simple heuristics such as random access, max-SINR association, and greedy allocation [16,18,22]. Against strong model-based optimization with full information, however, RL typically approaches rather than exceeds the optimum: multi-agent RL reaches about

90 %

of a WMMSE-plus-exhaustive-search baseline [17], and meta-RL attains

97 %

of an exhaustive-search fairness upper bound [23]. This pattern—RL beating weak heuristics but only matching strong optimization—frames our central question: can a learned policy exceed the best engineered heuristic for smart-meter association, and if so, why?

2.3. Hierarchical RL and Heuristic Selection

Our reformulation draws on two established ideas. Hierarchical RL (HRL) decomposes a complex decision into layers, with a high-level selection abstract actions or subtasks, and a low-level implementation of the detail [24,25]. This temporal and state abstraction reduces the number of decision points and shrinks the effective action space, improving sample efficiency in domains such as multi-UAV control [26] and UAV-assisted mobile-edge computing [27].

The second idea is RL-based heuristic selection, studied under hyper-heuristics and adaptive operator selection. There, the search process is modeled as a Markov decision process whose actions are choices among low-level heuristics or operators, and whose reward measures solution improvement [28,29,30,31]. Q-learning selection hyper-heuristics learn which low-level heuristic to apply at each step [32,34], and hierarchical hyper-heuristics organize both state and action spaces into layers [33]. Our regional strategy composition applies this principle to a networking problem: the RL agent selects which classical association heuristic to apply in each DAP region, making it a spatially distributed online heuristic selector rather than a direct controller. We position our method against two learned alternative benchmark methods. The first is end-to-end association, in which a single agent learns the per-meter decision directly. Deep and multi-agent RL have been applied to joint association and resource allocation in this end-to-end manner [21,38], but at our scale the per-meter action space is

2^{N}

(Section 4), which is precisely the intractability our reformulation avoids; an end-to-end learner must therefore either restrict the action space or accept the high-variance credit assignment described in Section 4. The second alternative is a fully hierarchical RL controller in which the low-level is itself learned. Hierarchical RL that decomposes an intractable action space into a high-level goal and a low-level policy has been used for uplink OFDMA/MU-MIMO scheduling [39], relay selection and power control [40], and multi-operator RIS optimization [41]. Our design differs deliberately: the high-level action is learned, while the low level is a fixed library of interpretable rules rather than a second learned policy. This is a hierarchical-action formulation, or equivalently a spatially distributed online hyper-heuristic, not an options- or feudal-style HRL controller; we therefore avoid the “hierarchical RL” label for the method itself. The trade-off is explicit: a learned low level could in principle express policies outside the fixed library, but at the cost of interpretability and of reintroducing a large low-level action space. To our knowledge, no prior learned method targets the specific two-path AMI association problem studied here on an equal footing, so we benchmark against the full set of meter-level heuristics (Section 6) and analyze the action-space expressiveness of the regional formulation directly (Section 4 and Section 7).

2.4. Deadline-Aware mMTC Scheduling

At the scheduling layer, mMTC traffic is served by grant-based, grant-free, and slice-based schedulers tailored to heterogeneous QoS requirements [3]. Deadline and age-of-information awareness is achieved through per-slot priority scheduling [12], deadline-budgeted resource allocation, and active queue management. Grant-free access with dynamic and priority-based slot allocation addresses contention among heterogeneous devices [13,14], and learning-based fast uplink grant uses traffic prediction to pre-allocate resources [15]. Energy–latency trade-offs in NB-IoT and slice-level multi-objective scheduling have also been studied [4,42]. Our work places a token-weighted scheduler at the intra-cluster layer and focuses the learning effort on the association layer above it.

3. System Model

3.1. Network Topology

We consider a single cell of radius

R_{cell} = 1000

m with a BS at the origin and

D = 6

DAPs placed on a ring of radius 500 m at uniform angular spacing. The set of meters is

M = {1, \dots, N}

with

N = 1500

. A fraction

ρ = 0.7

of meters are clustered near DAPs (within 30–180 m of a randomly chosen DAP) and the remainder are scattered uniformly in the cell. Each meter i has a fixed position

p_{i} \in R^{2}

, a distance to the BS

d_{i}^{BS} = ∥ p_{i} ∥

, and a nearest DAP

δ (i) = arg min_{d \in {1, \dots, D}} ∥ p_{i} - q_{d} ∥,

(1)

where

q_{d}

is the position of DAP d, with corresponding distance

d_{i}^{DAP} = ∥ p_{i} - q_{δ (i)} ∥

.

3.2. Physical Layer

We adopt 5G NR Numerology 0: subcarrier spacing

Δ f = 15

kHz,

N_{sc} = 12

subcarriers per RB, and slot duration

T_{slot} = 1

ms [43]. The RB bandwidth is

B_{RB} = Δ f N_{sc} = 180

kHz and the carrier frequency is

f_{c} = 3.5

GHz. Path loss follows the 3GPP urban-macro line-of-sight model [43].

PL (d) = 32.4 + 21 {log}_{10} (d) + 20 {log}_{10} (f_{c} [GHz]) [dB],

(2)

with d in meters. The received signal-to-noise ratio for meter i on a path with transmit power

P_{tx}

and aggregation-gain bonus G is

γ_{i} = P_{tx} + G - PL (d_{i}) - χ_{i} - P_{N},

(3)

where all quantities are in dB or dBm,

χ_{i} \sim N (0, σ_{s}^{2})

is log-normal shadowing with

σ_{s} = 6

dB, and the noise power is

P_{N} = - 174 + 10 {log}_{10} (B_{RB}) + N_{F},

(4)

with noise figure

N_{F} = 7

dB. The DAP path receives an SNR bonus

G = G_{DAP} = 12

dB reflecting the shorter meter-to-DAP link and relay processing gain; the direct path has

G = 0

. The achievable rate on one RB follows a capped Shannon expression

r_{i} = B_{RB} min ({log}_{2} (1 + 10^{γ_{i} / 10}), η_{max}),

(5)

where

η_{max} = 5.55

bit/s/Hz is the 256-QAM modulation ceiling. A packet of

L = 200

bits is delivered in one slot on an assigned RB if

r_{i} T_{slot} \geq L

.

3.3. Traffic and Congestion Cycles

Each meter generates packets according to a Bernoulli process. In slot t, meter i generates a new packet, if it has none pending, with probability

λ_{i} (t) = λ_{i}^{0} μ_{δ (i)} (t),

(6)

where

λ_{i}^{0} \sim U [λ_{min}, λ_{max}]

is a per-meter base rate fixed at episode start and

μ_{d} (t)

is the time-varying congestion multiplier of DAP d. To model the periodic synchronization events typical of AMI, such as scheduled meter-reading windows, each DAP has a phase-offset congestion cycle

μ_{d} (t) = 1 + (m - 1) sin (π \frac{(t - ϕ_{d}) mod T_{c}}{w}) 1 [(t - ϕ_{d}) mod T_{c} < w],

(7)

where

T_{c}

is the cycle period, w the peak width, m the peak multiplier, and

ϕ_{d} = (d - 1) T_{c} / D

the phase offset of DAP d. Because the phase offsets are spread across the period, and different DAPs reach their congestion peaks at different times, so no single static rule is optimal for all DAPs simultaneously.

3.4. Two-Layer Resource Management

The resource manager operates on two timescales, illustrated in Figure 1. At the scheduling layer (every slot), a token-weighted scheduler assigns the available RBs within each cluster (the direct group at the BS, or the meters attached to a given DAP). At the association layer (every

T_{e}

slots, the epoch length), each meter is assigned a transmission mode. The scheduling layer is fixed across all experiments. The association layer is where classical heuristics and learned policies compete. The direct group is served by

C_{BS}

RBs and the DAP backhaul by

C_{DAP}

RBs, each backhaul RB carrying up to A aggregated packets.

4. Problem Formulation

4.1. Optimization Problem

Before casting the task as a sequential decision process, we state the underlying optimization problem. Let

M = {1, \dots, N}

index the smart meters and

D = {1, \dots, D}

the DAP regions, observed over epochs

k = 1, \dots, K_{ep}

. The decision variable is the per-meter association mode

a_{i}^{(k)} \in {0, 1}, i \in M,

(8)

where 0 denotes the direct-to-BS path and 1 the DAP path at epoch k. Let

{dlv}_{i}^{(k)} \in {0, 1}

indicate whether meter i’s packet is delivered within its deadline at epoch k. The objective is to maximize deadline-compliant delivery while limiting unnecessary mode churn:

max_{{a_{i}^{(k)}}} \sum_{k} \sum_{i \in M} {dlv}_{i}^{(k)} - w_{s} \sum_{k} \sum_{i \in M} |a_{i}^{(k)} - a_{i}^{(k - 1)}|,

(9)

subject to the resource and buffer constraints of the system model:

\begin{matrix} \sum_{i : a_{i}^{(k)} = 0} 1 [i scheduled on BS] & \leq C_{BS}, & \forall k, \end{matrix}

(10)

\begin{matrix} \sum_{i : a_{i}^{(k)} = 1, δ (i) = d} 1 [i served by DAP d] & \leq A C_{d}^{bh}, & \forall d, \forall k, \end{matrix}

(11)

\begin{matrix} β_{d}^{(k)} & \leq β_{max}, & \forall d, \forall k, \end{matrix}

(12)

where

C_{BS}

is the number of resource blocks available for the direct path,

C_{d}^{bh}

the backhaul resource blocks at DAP d, A the aggregation factor (packets carried per backhaul resource block),

β_{d}^{(k)}

the buffer occupancy of DAP d,

β_{max}

the buffer capacity, and

δ (i)

the nearest-DAP region of meter i. Here,

1 [\cdot]

denotes the indicator function, equal to 1 when its argument is true and 0 otherwise, so that each constraint sum counts the number of meters actually scheduled on the corresponding path in epoch k. The term weighted by

w_{s}

in (9) penalizes mode switching, matching the reward in (13). The decision is made online under uncertainty about future arrivals, channel realizations, and congestion phases, which motivates the sequential MDP formulation that follows.

Although the static statement in (9)–(12) is an integer linear program, this formulation is a single-epoch, full-information snapshot of a problem that is in fact sequential, stochastic, and online, so it cannot be discharged by a standard ILP solver acting as the runtime controller. Three properties of the system are responsible. First, the objective is not known at decision time: whether a packet meets its deadline,

{dlv}_{i}^{(k)}

, depends on future arrivals and channel realizations that are revealed only after the association decision is taken, whereas a solver requires the full instance data in advance. Second, the constraints couple decisions across epochs through the DAP buffers: the occupancy

β_{d}^{(k)}

in (12) is a function of all earlier associations, since packets queue and drain over time, so the problem is a multi-stage stochastic program rather than a static one; solving it exactly as an integer program would require unrolling the entire horizon with all future arrivals and channels known, which is not available online. Third, even a myopic per-epoch relaxation would require re-solving an integer program with up to N binary variables every epoch (5 ms) and would still ignore the buffer and congestion-cycle dynamics that couple epochs, which is precisely the limitation of the fixed heuristics we compare against. The learning approach instead amortizes the optimization: the cost is paid once during training, and at runtime, each decision is a single forward pass that is independent of N and requires no solver in the control loop. An exact integer program is therefore useful as an offline, hindsight performance bound, but it is not a deployable online controller for this problem; this is the gap that the proposed framework is designed to fill.

4.2. Markov Decision Process

We model dynamic association as a Markov decision process (MDP)

(S, A, P, r, γ)

observed at the epoch timescale. At epoch k, the global state

s_{k} \in S

comprises, for each DAP d, the normalized buffer occupancy

b_{d} = β_{d} / β_{max}

(where

β_{d}

is the number of buffered packets and

β_{max}

the buffer capacity, and the cycle phase position

ψ_{d} = ((t - ϕ_{d}) mod T_{c}) / T_{c}

; the normalized BS load; the mean stale SNR; and, for each meter in the currently controlled subset, It is stale SNR on both paths, nearest-DAP buffer, deadline pressure, and current mode.

The reward at epoch k balances delivery, deadline compliance, and switching cost:

r_{k} = w_{p} {PDR}_{k} - w_{m} {MISS}_{k} - w_{s} \frac{n_{k}^{sw}}{| U_{k} |},

(13)

where

{PDR}_{k}

is the fraction of packets delivered during the epoch,

{MISS}_{k}

the fraction that exceeded their deadline,

n_{k}^{sw}

the number of meters that changed mode, and

U_{k}

the controlled subset. The weights are

w_{p} = 1.0

,

w_{m} = 1.5

,

w_{s} = 0.02

.

4.3. The Per-Meter Action Space and Its Intractability

A direct formulation assigns each meter i a binary mode

a_{i} \in {0, 1}

, where 0 denotes the direct path and 1 the DAP path. The joint action is

a = (a_{1}, \dots, a_{N}) \in {0, 1}^{N}

, so the action space has cardinality

| A_{meter} | = 2^{N} .

(14)

For

N = 1500

this is

2^{1500}

, a number with more than 450 decimal digits. A policy gradient learner receives one scalar reward per joint action and must estimate gradients over this space from sampled returns. The credit-assignment problem is severe: the contribution of any single

a_{i}

to the reward is masked by the other

N - 1

decisions, so the variance of the gradient estimate is large, and learning does not converge to a useful policy. This is the failure mode that motivates the reformulation below.

4.4. Regional Strategy Composition

We use “hierarchical” here to describe the action structure: a high-level regional action is deterministically expanded into low-level per-meter modes. The low level is a fixed library of interpretable rules, not a second learned policy, so the method is a hierarchical-action controller (equivalently, an online hyper-heuristic), and not an options- or feudal-style hierarchical RL agent. This design choice trades the expressiveness of a learned low level for interpretability and a compact, tractable action space. We introduce a hierarchical action in which the agent does not choose modes directly but chooses, for each DAP region d, a strategy index

g_{d} \in {0, \dots, K - 1}

from a library of K parameterized rules. The regional action is

g = (g_{1}, \dots, g_{D}) \in {0, \dots, K - 1}^{D}

, with cardinality

| A_{region} | = K^{D} .

(15)

For

K = 5

strategies and

D = 6

DAPs this is

5^{6} = 15,625

, a tractable space for both discrete (DQN) and policy gradient (PPO) learners. A deterministic mapping Φ then expands the regional action into per-meter modes,

a = Φ (g, s_{k}), a_{i} = h_{g_{δ (i)}} (i, s_{k}),

(16)

where

h_{g} (\cdot)

is the meter-level rule of strategy g applied with the local state of meter i. Each meter is assigned to the region of its nearest DAP

δ (i)

and receives the strategy chosen for that region. Two properties follow directly from (16). First, the action space is reduced from

2^{N}

to

K^{D}

, a reduction by a factor of

2^{N} / K^{D}

that for our parameters exceeds

10^{440}

. Second, because Φ acts per meter using the local state, every meter still receives an individualized decision; the reduction is in the policy’s action space, not in control granularity.

The key consequence is representational. A single fixed heuristic applies one rule

h_{g}

to all regions at all times. The regional policy can apply different rules to different regions in the same epoch, for example assigning a buffer-aware rule to lightly loaded DAPs and a cycle-aware rule to DAPs approaching their congestion peak. The set of policies expressible by regional composition strictly contain the set of single fixed heuristics, so the optimal regional policy is at least as good as the best single heuristic, and strictly better whenever Region-specific rule selection helps.

Algorithm 1 details the deterministic expansion Φ that converts the regional action

g

into per-meter modes. For each meter, the strategy assigned to its nearest DAP region is applied using only that meter’s local state, so the expansion is

O (N)

per epoch and adds negligible overhead to the control loop.

Algorithm 1 Regional-to-meter expansion $Φ (g, s)$
Require: regional action $g = (g_{1}, \dots, g_{D})$ , state s
Ensure: per-meter modes $a = (a_{1}, \dots, a_{N})$
1: for each meter $i \in M$ do
2: $d \leftarrow δ (i)$	▹ index of nearest DAP
3: $g \leftarrow g_{d}$	▹ strategy chosen for that region
4: $a_{i} \leftarrow h_{g} (i, s)$	▹ apply rule with local state
5: end for
6: return $a$

4.5. Deployment Architecture and Information Model

The controller is centralized at the base station or an adjacent edge node, which is the natural location in an AMI deployment because the BS already terminates the uplink of both the direct meters and the DAP backhaul. The state consumed by the policy is aggregated at the epoch timescale (5 ms in our setup): per-region buffer occupancy and cycle phase is reported by the DAPs over the backhaul, and the per-meter channel and queue quantities are already available at the BS scheduler from uplink control signaling. The policy therefore does not require any new measurement infrastructure beyond what a scheduling BS already collects, and the signaling it adds is limited to per-region summaries exchanged once per epoch, not per slot or per meter. A centralized controller and a hierarchical action are not in tension. The hierarchy here is not a mechanism for distributing computation across agents; it is a mechanism for reducing the size of the action space that a single centralized policy must search. Even with full central information, learning directly over the

2^{N}

per-meter action space is intractable (Section 4); the regional action makes the centralized learning problem solvable while the deterministic expansion preserves per-meter granularity. The benefit of the hierarchy is thus representational rather than architectural, and it applies precisely because the controller is centralized.

5. Methodology

5.1. Strategy Library

The library contains

K = 5

interpretable strategies, each a meter-level rule

h_{g} (i, s)

returning a mode in

{0, 1}

.

Direct-priority ( $g = 0$ ): assign the direct path, $h_{0} = 0$ . Useful when the region’s DAP is congested or unreliable.
DAP-priority ( $g = 1$ ): assign the DAP path when the link is adequate, and the buffer is not critical, $h_{1} = 1 [γ_{i}^{DAP} > γ_{th} \land b_{δ (i)} < b_{hi}]$ .
Buffer-aware ( $g = 2$ ): use the DAP only if its buffer is safe, $h_{2} = 1 [b_{δ (i)} < b_{th}]$ .
Cycle-aware ( $g = 3$ ): avoid the DAP during or shortly before its congestion peak, $h_{3} = 1 [ψ_{δ (i)} \notin P \land b_{δ (i)} < b_{hi}]$ , where $P$ is the peak-and-approach phase window.
Deadline-aware ( $g = 4$ ): for urgent packets, choose the path with the higher immediate delivery probability given stale CSI and buffer state; otherwise, default to buffer-aware behavior.

These rules are the same logic used by the corresponding standalone baselines, so the regional policy comprises exactly the heuristics against which it is compared. It makes the comparison fair: any advantage of the learned policy comes from selecting among the rules per region and per epoch, not from access to a rule, the baselines lack.

5.2. Learning Algorithms

We train two controllers under the regional formulation.

PPO. Proximal Policy Optimization optimizes a clipped surrogate objective

L^{CLIP} (θ) = E_{k} [min (ϱ_{k} (θ) {\hat{A}}_{k}, clip (ϱ_{k} (θ), 1 - ϵ, 1 + ϵ) {\hat{A}}_{k})],

(17)

where

ϱ_{k} (θ) = π_{θ} (g_{k} ∣ s_{k}) / π_{θ_{old}} (g_{k} ∣ s_{k})

is the probability ratio,

{\hat{A}}_{k}

the generalized advantage estimate, and

ϵ = 0.15

is the clip range. The policy is a multilayer perceptron with two hidden layers of 256 units that outputs D independent categorical distributions, one per region (a MultiDiscrete action). We use a learning rate that decays linearly from

3 \times 10^{- 4}

to

1 \times 10^{- 5}

, a rollout length of 2048 steps, a batch size of 128, 8 epochs per update, discount

γ = 0.99

, and generalized-advantage parameter

0.95

.

DQN. The Deep Q-Network learns the action-value function

Q (s, a)

by minimizing the temporal-difference loss

L (θ) = E [{(r + γ max_{a^{'}} Q_{θ^{-}} (s^{'}, a^{'}) - Q_{θ} (s, a))}^{2}],

(18)

where

θ^{-}

are the target-network parameters. Because DQN requires a single discrete action, we encode the regional choice as one of a set of predefined regional patterns, so that each discrete action expands to a region-wise strategy assignment. The learning rate decays linearly from

5 \times 10^{- 4}

to

5 \times 10^{- 5}

, the replay buffer holds

10^{5}

transitions, the batch size is 128, and the exploration fraction is

0.40

of training.

For both algorithms, observations are normalized online with a running mean and variance, and rewards are scaled likewise. The full training and evaluation pipeline, environment, and configuration are released for reproducibility.

Algorithm 2 summarizes the epoch-level control loop shared by both controllers. At each epoch, the agent observes the global state, selects a regional action, expands it to per-meter modes via Algorithm 1, and the environment runs

T_{e}

slots of token-weighted scheduling before returning the reward of (13).

Algorithm 2 Regional association control loop (one episode)
Require: trained or learning policy $π_{θ}$ , epoch length $T_{e}$ , episode length T
1: initialize meter modes, DAP buffers, packet queues
2: for epoch $k = 0, 1, \dots, T / T_{e} - 1$ do
3: observe global state $s_{k}$
4: sample regional action $g_{k} \sim π_{θ} (\cdot ∣ s_{k})$
5: $a_{k} \leftarrow Φ (g_{k}, s_{k})$	▹ Algorithm 1
6: apply modes $a_{k}$ ; count switches $n_{k}^{sw}$
7: for slot $= 1, \dots, T_{e}$ do
8: generate arrivals with rate $λ_{i} (t)$
9: serve direct group with $C_{BS}$ RBs (token-weighted)
10: enqueue DAP-mode packets to nearest-DAP buffers
11: flush DAP buffers via $C_{DAP}$ backhaul RBs
12: drop packets exceeding deadline $T_{dl}$
13: end for
14: compute reward $r_{k}$ by (13)
15: if training then
16: store transition $(s_{k}, g_{k}, r_{k}, s_{k + 1})$ ; update $θ$
17: end if
18: end for

6. Experimental Setup

6.1. Calibration

Table 1 lists the simulator parameters. The RB pools and deadlines are calibrated so that the strongest baseline operates in a discriminating regime—neither saturated near

100 %

nor collapsed—which is where differences among policies are meaningful. The mMTC slice provides

C_{BS} = 20

direct RBs and

C_{DAP} = 8

backhaul RBs with aggregation factor

A = 4

; the deadline is

T_{dl} = 25

slots, the epoch length

T_{e} = 5

slots, and the episode length: 240 slots, spanning four full congestion cycles.

The physical-layer parameters (carrier frequency, bandwidth, noise figure, and path-loss model) follow the 3GPP NR system parameters and the urban macro path-loss model of 3GPP TR 38.901 [43]. The smart-meter traffic parameters (packet size of 200 bytes, device population on the order of

10^{3}

per cell, and the Poisson/Beta arrival abstraction) follow the machine-type-communication traffic model of 3GPP TR 37.868 [44], which is the standard reference for mMTC evaluation; the use of a synthetic parametric arrival model in place of field traces is consistent with the learning-based mMTC scheduling literature [15,45].

6.2. Scenarios

Three traffic regimes test different stress modes.

S1 (baseline). Nominal arrival rates and cycles. A control regime in which the system is comfortably provisioned.
S2 (high load). Arrival rates are amplified network-wide, stressing both transmission paths so that no single rule dominates.
S3 (bursty cycle). Sharper, more concentrated congestion peaks, so that the timing of DAP usage matters more, and the prediction of the cycle is more valuable.

6.3. Baselines

We compare against eight meter-level policies that operate at full 1500-m granularity. DirectOnly and DAPOnly are the two extreme fixed policies. DistThreshold attaches a meter to its DAP if it lies within a distance threshold. CycleAware, DeadlineAware, and BufferAware apply the corresponding single rules from the strategy library of Section 5 globally: BufferAware routes a meter to its DAP only when that DAP’s buffer occupancy is below a safe threshold, and DeadlineAware routes urgent packets to the path with the higher immediate delivery probability. HybridMeter is a hand-crafted policy that combines these two rules with the cycle rule in a fixed precedence: it first applies the buffer-occupancy check, then overrides toward the direct path when the DAP is in or approaching a congestion peak (the cycle rule), and finally prioritizes packets that are close to their deadline (the deadline rule). Unlike the regional controller, this precedence is fixed and identical across all DAP regions. OracleCycle is a strong cycle-aware reference. The learned controllers PPO and DQN use the regional formulation.

6.4. Protocol and Statistics

Each configuration is run for five random seeds

{42, 123, 777, 2024, 31, 415}

, with 10 evaluation episodes per seed. PPO is trained for

3 \times 10^{5}

steps and DQN for

1.5 \times 10^{5}

steps. We report the mean PDR and deadline-miss rate with standard deviation, and we test significance with a Welch t-test over episodes, a Mann–Whitney U test, a paired t-test over per-seed means, and Cohen’s d effect size.

7. Analysis of Results

7.1. Overall Comparison

Table 2 reports PDR for all policies in all three scenarios. The two extreme fixed policies (DirectOnly, DAPOnly) deliver the lowest PDR, confirming that neither path alone is adequate: DirectOnly reaches only

0.497

–

0.574

and DAPOnly

0.533

–

0.621

across scenarios, because the direct path is resource-limited and the DAP-only path overloads buffers. Among the heuristics, BufferAware is consistently strongest, reaching

0.937

in S1,

0.829

in S2, and

0.816

in S3.

The learned PPO controller exceeds BufferAware in every scenario, by

+ 0.63

percentage points (pp) in S1,

+ 2.41

pp in S2, and

+ 2.66

pp in S3. DQN also exceeds BufferAware in S2 and S3 and matches it in S1. Both learned controllers also reduce the deadline-miss rate in every scenario. Figure 2 shows this comparison graphically, where the two fixed extremes are clearly inadequate and the learned controllers sit above the strongest heuristic in every scenario.

7.2. Statistical Significance

Table 3 reports the significance of the PPO and DQN advantage over BufferAware. The paired t-test over per-seed means, which respects the seed-level independence structure, yields

p < 0.001

for PPO in all three scenarios. The Cohen’s d effect sizes are large to very large, reaching

d = 5.12

for PPO in S3. The

95 %

confidence intervals on the PDR difference exclude zero in every case, and the per-seed direction of the gap is positive for all five seeds in all scenarios.

7.3. Deadline-Miss Rate

A reduction in deadline accompanies the improvement in PDR misses. Figure 3 reports the mean deadline-miss rate for every policy in the three scenarios. PPO reduces the miss rate relative to BufferAware by

0.75

pp in S1,

2.69

pp in S2, and

2.92

pp in S3, all significant at

p < 0.001

. The joint improvement in delivery and deadline compliance indicates that the learned policy is not trading one objective for the other but improving both, consistent with the reward in (13).

7.4. The Advantage Grows with Stress

The PPO advantage over BufferAware increases monotonically with traffic difficulty:

+ 0.63

pp in the comfortable baseline,

+ 2.41

pp under high load, and

+ 2.66

pp under bursty cycles. Both learned controllers reach this performance from converged training rather than from an under-trained transient: Figure 4 shows that the mean episode reward of PPO and DQN plateaus well before the training budget is exhausted. In the baseline regime BufferAware already attains

0.937

PDR, leaving little headroom, so the gain is small though still significant. As load and burstiness increase, no single rule is adequate everywhere, and the value of selecting different rules per region grows. This pattern is the central empirical finding: regional composition helps most precisely when the problem is hard enough that no fixed heuristic is globally optimal. The heat map in Figure 5 summarizes the full policy-by-scenario ordering, and the per-episode distributions in Figure 6 confirm that the learned advantage is consistent across episodes and seeds rather than driven by outliers.

7.5. Robustness and Sensitivity Analysis

To assess how the findings depend on the configuration, we vary four parameters: the strategy-library size K, the number of DAPs D, the cell radius, and the aggregation factor A. To contain training cost, the K sweep is run across all three traffic regimes, and the D, radius, and A sweeps are run on the baseline scenario with three seeds and the same training budget as the main experiments. Figure 7 summarizes the four sweeps.

7.5.1. Strategy-Library Size K

Increasing K from 3 to 4 improves PDR under load (for example, from

0.849

to

0.857

in the high-load regime), after which performance saturates:

K = 5

gives

0.853

, statistically indistinguishable from

K = 4

. The learned policy exceeds the strongest single heuristic at every K in every regime. This is the expected diminishing-returns behavior and justifies the library size: a handful of complementary rules captures most of the achievable gain, and adding further rules yields no measurable improvement. The K values used here are subsets of the existing five-rule library, so no new heuristics are introduced.

7.5.2. Number of DAPs D

At

D = 4

and

D = 8

on the baseline scenario, the learned controller holds its advantage over BufferAware (for example,

0.983

versus

0.977

at

D = 4

, and

0.978

versus

0.972

at

D = 8

), confirming the ordering is not an artifact of the six-DAP topology used in the main experiments.

7.5.3. Cell Radius

Varying the cell radius between

0.5

km and

1.5

km with the same meter count leaves the ordering unchanged; the learned controller remains at or above the strongest heuristic at both radii, indicating robustness to the path-loss regime.

7.5.4. Aggregation Factor A

The aggregation factor controls how many buffered packets each backhaul resource block can carry. As A increases from 2 to 4 to 8, PDR rises for all policies (from

0.75

to

0.94

to

0.99

on the baseline scenario), because additional backhaul capacity relieves the DAP bottleneck. The learned controller remains at or above the heuristics across the entire range. The constrained setting

A = 2

is the most discriminative; at

A = 8

the system is near saturation and all policies converge.

7.5.5. Action-Space Expressiveness

The reformulation reduces the joint action space from

2^{N}

to

K^{D}

. For

N = 1500

,

D = 6

, the per-meter space has cardinality

2^{1500} \approx 10^{452}

, whereas the regional space ranges from

3^{6} = 729

at

K = 3

to

5^{6} = 15,625

at

K = 5

(Figure 8). The reduction is by more than 447 orders of magnitude, and the regional cardinality is independent of N: it depends only on K and D. The regional policy cannot express every one of the

2^{N}

per-meter assignments; it can express any assignment that is realizable as a per-region choice among the K rules, expanded by Φ using each meter’s local state. Since Φ applies a rule per meter using that meter’s own buffer, deadline, and channel state, the regional policy retains per-meter differentiation while searching a space small enough for stable learning. The empirical K sweep above shows that this restricted but tractable space is sufficient to exceed every fixed heuristic.

7.6. What the Policy Learned

Inspection of the regional action distribution, shown in Figure 9, reveals that PPO does not collapse to a single rule. Across scenarios, it allocates the largest share to DAP-priority (39–

45 %

of regional decisions) while retaining substantial use of cycle-aware (13–

16 %

), deadline-aware (14–

24 %

), buffer-aware (13–

17 %

), and direct-priority (7–

13 %

) strategies. Under the bursty-cycle scenario, the deadline-aware share rises relative to the baseline, consistent with the Sharp peaks create greater deadline pressure. The policy thus composes rules in a state-dependent way rather than imitating any single baseline.

7.7. Cost of the Learned Policy

The improvement carries two costs that we report for completeness. First, the learned controllers reassociate meters more often than the lightest heuristics: PPO performs on the order of

5 \times 10^{4}

mode switches per episode, comparable to BufferAware and more than HybridMeter (≈

3.5 \times 10^{4}

). Second, the inference time of the neural policy is higher than that of a rule: PPO requires on the order of

10^{3}

microseconds per epoch decision versus tens of microseconds for the heuristics. Because association decisions are made once per epoch (5 ms), This inference cost is small relative to the decision interval and does not affect real-time operation, but it is a consideration for very large deployments. Figure 10 places each policy in the switching-cost versus PDR plane, showing that the learned controllers attain the highest delivery at a switching cost comparable to the strongest heuristic.

7.8. Scalability with Meter Count

A practical question is whether the formulation remains viable as the meter population grows beyond the 1500 m of the main study. The central property is that the regional action space is

K^{D}

, which is independent of N: it remains

15,625

at

N = 1500

, 3000, and 5000. We confirm that the learning remains stable and the delivery performance is maintained as N grows, provided the radio resources are provisioned to the offered load. Scaling the direct-path and backhaul resource blocks and the DAP buffer in proportion to N, the mean PDR is

0.943

,

0.944

, and

0.970

at

N = 1500

, 3000, and 5000, respectively (Figure 11), while the per-decision action space is unchanged and the training time grows only moderately, from 31 to 47 min on a single NVIDIA L4 GPU. The policy network has on the order of

1.8 \times 10^{5}

parameters, so its memory footprint is a few megabytes and does not grow with N; the observation is composed of per-region aggregates plus a controlled-subset summary, whose dimensionality is set by D and the subset size rather than by N. The formulation therefore scales to larger meter populations without growth in the action space, the policy size, or the per-decision inference cost. Absolute delivery, as for any policy, depends on the ratio of offered load to provisioned capacity rather than on the controller; the point of this experiment is that the method’s decision complexity is invariant to N.

8. Discussion

8.1. Why Regional Composition Works

The result can be understood through the representational argument of Section 4. A single fixed heuristic is a special case of the regional policy in which the same rule is assigned to all regions. The regional policy, therefore, cannot do worse than the best single rule, and it does strictly better whenever the optimal rule differs across regions or over time. The phase-offset congestion cycles of (7) guarantee that such differences exist: at any moment, some DAPs are near their peak while others are quiet, so a globally applied rule is necessarily suboptimal for part of the network. The learned policy exploits exactly this structure, which is why its advantage is largest in the high-load and bursty-cycle regimes where the heterogeneity across regions is most pronounced.

8.2. Relation to the Literature

Our finding is consistent with the nuanced picture in the RL-for-wireless literature [17,23]: learned policies clearly beat weak heuristics, and against a strong engineered heuristic, the margin is smaller, but with an appropriate action formulation, can be made positive and statistically robust. The mechanism—selecting among heuristics rather than replacing them—places our method in the hyper-heuristic and adaptive-operator-selection tradition [28,29,33], transplanted from combinatorial optimization for a spatially distributed networking control problem. The hierarchical action structure follows the action-space reduction principle of hierarchical RL [24,26].

8.3. Practical Implications

For AMI operators, the results suggest that dynamic association can be implemented as a lightweight regional controller layered on top of existing rule libraries, rather than as a meter-level learner that would be intractable to train. Because the controller selects among interpretable rules, its decisions remain auditable: an operator can Read off which strategy is active in each region. The improvement is modest in comfortable conditions and largest under stress, which is the regime that matters most for service guarantees.

8.4. Operating-Point Analysis

The PDR margins reported above are averages, but an AMI operator provisions to a service-level target and cares about how often that target is met. The exact target is not universal: it depends on the traffic class and the utility’s reliability policy, and AMI deployments distinguish fixed-schedule from event-driven reporting with different requirements [12]. Rather than assume a single value, we therefore characterize the full relationship between a candidate target

θ

and the resulting service-level compliance. For each

θ

, we measure the fraction of evaluation episodes in which a policy achieves

PDR \geq θ

, over the 50 episodes per policy in the high-load scenario. Figure 12 shows the result. The separation between the learned controllers and the heuristics is much larger in this view than the average margin suggests, and it holds across a range of targets rather than at a single point: for any target in the

0.82

–

0.86

band, PPO and DQN meet the target in a substantially larger fraction of episodes than BufferAware. For example, at

θ = 0.84

the compliance fractions are

100 %

(PPO),

84 %

(DQN), and

30 %

(BufferAware). The practical reading is that a small average margin translates into a large difference in the probability of meeting a deployment target, whatever specific target a utility adopts within this range.

8.5. Limitations and Future Work

Several limitations bound the scope of this study and point to future work.

The evaluation considers a single cell with six DAPs. Inter-cell interference, handover, and cross-cell load balancing are therefore out of scope, and the absolute numbers should not be read as multi-cell performance. The regional formulation extends to multiple cells by treating each cell’s DAPs as additional regions, and a multi-cell study is the most important next step.

Traffic is generated by a Bernoulli arrival process modulated by deterministic congestion cycles. It follows the standard practice in the mMTC scheduling literature, which evaluates with synthetic parametric arrival models such as Poisson, Bernoulli, Beta, and Markov-modulated Poisson processes [12,13]; the 3GPP machine-type traffic model is itself a synthetic Beta-modulated process. We do not use recorded field traces. Publicly available smart-meter datasets are energy-consumption time series at minute-to-hour granularity and do not describe packet-level arrivals at the millisecond air interface, so they are not directly usable as scheduling input. Validation on operator packet-level traces, where available, would strengthen the external validity of the results and is left to future work. The association decision in our system is governed primarily by buffer occupancy, deadline pressure, and congestion phase rather than by instantaneous channel realizations. At the modeled signal-to-noise ratios the system operates in a resource-limited rather than an SNR-limited regime: a scheduled transmission almost always meets the rate threshold, so which meters are scheduled, and on which path, is determined by queueing and capacity rather than by small channel fluctuations. We verified this behavior directly: varying the channel-state-information update interval over the studied epoch timescale leaves the delivery results effectively unchanged, because stale channel knowledge does not alter the buffer- and deadline-driven association decisions in this regime. We therefore model the channel as quasi-static over the epoch and report the results accordingly. A regime in which the channel is the binding constraint—with fast fading and explicitly delayed or erroneous CSI, where channel knowledge directly governs the association decision—is a distinct operating point and a worthwhile direction for future work.

The strategy library is fixed and hand-designed. The contribution of this paper is the regional selection mechanism rather than the rules it selects among. Learning the rules jointly with the selection policy, for example, through a learned low-level controller under the same regional high-level action, is a natural extension.

The comparison is against classical heuristic baselines and an optimal matching at the scheduling layer. As discussed in Section 2, the two natural learned alternatives—end-to-end per-meter association and a fully hierarchical RL controller with a learned low level—either confront the

2^{N}

action space that our reformulation is designed to avoid, or replace the interpretable rule library with a second learned policy; to our knowledge, no prior learned method targets this two-path AMI association problem on an equal footing. A comparison against adapted learned baselines from related association problems would nonetheless sharpen the evaluation and is left to future work.

The high-load and bursty scenarios are evaluated at an operating point where the resource constraint is binding, which places the absolute PDR in the mid-

0.80

range. It is intentional: the baseline scenario shows that at a lightly loaded operating point, all policies perform similarly, so a binding regime is required to separate them. The transferable findings are the relative ordering of policies and the service-level-compliance gap of Section 8.4, both of which hold across the three scenarios.

9. Conclusions

We addressed the dynamic smart-meter association in 5G NR mMTC networks, where each meter must choose at run time between a direct path to the base station and an aggregated path through a DAP. The naive per-meter formulation has an action space of size

2^{N}

that is intractable for reinforcement learning. We proposed a regional strategy composition in which the agent selects one interpretable association strategy per DAP region and a deterministic mapping expands the choice to per-meter modes, reducing the action space to

K^{D}

while preserving meter-level granularity. On a 5G NR-calibrated simulator with 1500 m and six DAPs, a PPO controller under this formulation, improved packet delivery ratio over the strongest heuristic by

+ 0.63

,

+ 2.41

, and

+ 2.66

percentage points in baseline, high-load, and bursty-cycle regimes, all statistically significant over five seeds, with the advantage growing as the traffic became harder. The learned policy also reduced the deadline misses in every regime. The central conclusion is that composing classical heuristics regionally, rather than replacing them, lets a learned controller exceed any single fixed heuristic precisely when the individual rule is globally optimal. Future work includes multi-cell extension, learning the strategy library jointly with the selection policy, and validation on measured AMI traffic.

Author Contributions

Conceptualization, M.A.-A. and E.Y.; methodology, M.A.-A. and E.Y.; software, M.A.-A.; validation, M.A.-A., J.I., E.I. and E.Y.; formal analysis, M.A.-A.; investigation, M.A.-A.; writing—original draft preparation, M.A.-A.; writing—review and editing, M.A.-A., J.I., E.I. and E.Y.; supervision, E.Y. and E.I. All authors have read and agreed to the published version of the manuscript.

Funding

This work was made possible by Qatar University–Grant no. QUST-CENG-2026-573. The research outcomes and statements made herein are solely the responsibility of the authors.

Data Availability Statement

The simulation code, environment, trained policies, and result files that support the findings of this study are available from the corresponding author upon reasonable request.

Acknowledgments

Generative AI tools (Claude Opus 4.7/4.8) were used to assist with language refinement, drafting portions of the manuscript, coding support, and brainstorming/discussion of ideas.

Conflicts of Interest

The authors declare no conflict of interest.

References

Molokomme, D.N.; Chabalala, C.S.; Bokoro, P.N. Enhancement of Advanced Metering Infrastructure Performance Using Unsupervised K-Means Clustering Algorithm. Energies 2021, 14, 2732. [Google Scholar] [CrossRef]
Gallardo, J.; Ahmed, M.; Jara, N. Clustering Algorithm-Based Network Planning for Advanced Metering Infrastructure in Smart Grid. IEEE Access 2021, 9, 48992–49006. [Google Scholar] [CrossRef]
Haque, M.M.; Tariq, F.; Khandaker, M.; Wong, K.K.; Zhang, Y. A Survey of Scheduling in 5G URLLC and Outlook for Emerging 6G Systems. IEEE Access 2023, 11, 34372–34396. [Google Scholar] [CrossRef]
Elgarhy, O.; Reggiani, L.; Alam, M.M.; Zoha, A.; Ahmad, R.; Kuusik, A. Energy Efficiency and Latency Optimization for IoT URLLC and mMTC Use Cases. IEEE Access 2024, 12, 23132–23148. [Google Scholar] [CrossRef]
Lang, A.; Wang, Y.; Feng, C.; Stai, E.; Hug, G. Data Aggregation Point Placement for Smart Meters in the Smart Grid. IEEE Trans. Smart Grid 2022, 13, 541–554. [Google Scholar] [CrossRef]
Sung, T.W.; Xu, Y.; Hu, X.; Lee, C.S.; Fang, Q. Optimizing data aggregation point location with grid-based model for smart grids. J. Intell. Fuzzy Syst. 2022, 42, 3189–3201. [Google Scholar] [CrossRef]
Inga, E.; Dai, Y.; Inga, J.; Zhang, K. Connectivity-Oriented Optimization of Scalable Wireless Sensor Topologies for Urban Smart Water Metering. Smart Cities 2025, 8, 167. [Google Scholar] [CrossRef]
Inga, E.; Inga, J.; Ortega, A. Novel Approach Sizing and Routing of Wireless Sensor Networks for Applications in Smart Cities. Sensors 2021, 21, 4692. [Google Scholar] [CrossRef] [PubMed]
Abdullah, A.; Ashraf, E. New Dual Algorithm to Placement the Data Aggregation Point for Smart Grid Meters. Smart Grids Sustain. Energy 2024, 9, 21. [Google Scholar] [CrossRef]
Khan, A.; Umar, A.; Munir, A.; Shirazi, S.; Khan, M.; Adnan, M. A QoS-Aware Machine Learning-Based Framework for AMI Applications in Smart Grids. Energies 2021, 14, 8171. [Google Scholar] [CrossRef]
Khan, A.; Shirazi, S.; Adeel, M.; Assam, M.; Ghadi, Y.; Mohamed, H.; Xie, Y. A QoS-Aware Data Aggregation Strategy for Resource Constrained IoT-Enabled AMI Network in Smart Grid. IEEE Access 2023, 11, 98988–99004. [Google Scholar] [CrossRef]
Kim, B. A priority-aware dynamic scheduling algorithm for ensuring data freshness in 5G networks. Future Gener. Comput. Syst. 2025, 163, 107542. [Google Scholar] [CrossRef]
Weerasinghe, T.; Casares-Giner, V.; Balapuwaduge, I.; Li, F. Priority Enabled Grant-Free Access With Dynamic Slot Allocation for Heterogeneous mMTC Traffic in 5G NR Networks. IEEE Trans. Commun. 2021, 69, 3192–3206. [Google Scholar] [CrossRef]
Kaura, Y.; Lall, B.; Mallik, R.; Singhal, A. Adaptive Scheduling of Shared Grant-Free Resources for Heterogeneous Massive Machine Type Communication in 5G and Beyond Networks. IEEE Trans. Netw. Serv. Manag. 2025, 22, 1188–1204. [Google Scholar] [CrossRef]
Eldeeb, E.; Shehab, M.; Alves, H. A Learning-Based Fast Uplink Grant for Massive IoT via Support Vector Machines and Long Short-Term Memory. IEEE Internet Things J. 2021, 9, 3889–3898. [Google Scholar] [CrossRef]
Liu, S.; Pan, C.; Zhang, C.; Yang, F.; Song, J. Dynamic Spectrum Sharing Based on Deep Reinforcement Learning in Mobile Communication Systems. Sensors 2023, 23, 2622. [Google Scholar] [CrossRef] [PubMed]
Lu, Z.; Zhong, C.; Gursoy, M.C. Dynamic Channel Access and Power Control in Wireless Interference Networks via Multi-Agent Deep Reinforcement Learning. IEEE Trans. Veh. Technol. 2021, 71, 1588–1601. [Google Scholar] [CrossRef]
Chaieb, C.; Abdelkefi, F.; Ajib, W. Deep Reinforcement Learning for Resource Allocation in Multi-Band and Hybrid OMA-NOMA Wireless Networks. IEEE Trans. Commun. 2023, 71, 187–198. [Google Scholar] [CrossRef]
Alizadeh, A.; Lim, B.; Vu, M. Multi-Agent Q-Learning for Real-Time Load Balancing User Association and Handover in Mobile Networks. IEEE Trans. Wirel. Commun. 2024, 23, 9001–9015. [Google Scholar] [CrossRef]
Wang, D.; Li, R.; Huang, C.; Xu, X.; Chen, H. User Association and Power Allocation for User-Centric Smart-Duplex Networks via Tree-Structured Deep Reinforcement Learning. IEEE Internet Things J. 2023, 10, 20216–20229. [Google Scholar] [CrossRef]
Tao, Z.; Xu, W.; You, X. Large Vision Model-Enhanced Digital Twin With Deep Reinforcement Learning for User Association and Load Balancing in Dynamic Wireless Networks. IEEE J. Sel. Areas Commun. 2026, 44, 2718–2732. [Google Scholar] [CrossRef]
Hsieh, C.K.; Chan, K.L.; Chien, F.T. Energy-Efficient Power Allocation and User Association in Heterogeneous Networks with Deep Reinforcement Learning. Appl. Sci. 2021, 11, 4135. [Google Scholar] [CrossRef]
Giwa, O.; Awodunmila, T.; Mohsin, M.; Bilal, A.; Jamshed, M. Meta-Reinforcement Learning for Fast and Data-Efficient Spectrum Allocation in Dynamic Wireless Networks. IEEE Wirel. Commun. Lett. 2026, 15, 2000–2004. [Google Scholar] [CrossRef]
Hutsebaut-Buysse, M.; Mets, K.; Latré, S. Hierarchical Reinforcement Learning: A Survey and Open Research Challenges. Mach. Learn. Knowl. Extr. 2022, 4, 172–221. [Google Scholar] [CrossRef]
Eppe, M.; Gumbsch, C.; Kerzel, M.; Nguyen, P.; Butz, M.; Wermter, S. Intelligent problem-solving as integrated hierarchical reinforcement learning. Nat. Mach. Intell. 2022, 4, 11–20. [Google Scholar] [CrossRef]
Wang, H.; Wang, J. Enhancing multi-UAV air combat decision making via hierarchical reinforcement learning. Sci. Rep. 2024, 14, 4458. [Google Scholar] [CrossRef] [PubMed]
Yang, Y.; Shi, Y.; Cui, X.; Li, J.; Zhao, X. A Hybrid Decision-Making Framework for UAV-Assisted MEC Systems. Drones 2025, 9, 206. [Google Scholar] [CrossRef]
Yi, W.; Qu, R.; Jiao, L.; Niu, B. Automated Design of Metaheuristics Using Reinforcement Learning Within a Novel General Search Framework. IEEE Trans. Evol. Comput. 2023, 27, 1072–1084. [Google Scholar] [CrossRef]
Tian, Y.; Li, X.; Ma, H.; Zhang, X.; Tan, K.C.; Jin, Y. Deep Reinforcement Learning Based Adaptive Operator Selection for Evolutionary Multi-Objective Optimization. IEEE Trans. Emerg. Top. Comput. Intell. 2023, 7, 1051–1064. [Google Scholar] [CrossRef]
Guo, H.; Ma, Y.; Zhang, Z.; Chen, J.; Zhang, X.; Cao, Z.; Zhang, J.; Gong, Y. Deep Reinforcement Learning for Dynamic Algorithm Selection: A Proof-of-Principle Study on Differential Evolution. IEEE Trans. Syst. Man Cybern. Syst. 2024, 54, 4247–4259. [Google Scholar] [CrossRef]
Li, C.; Wei, X.; Wang, J.; Wang, S.; Zhang, S. A review of reinforcement learning based hyper-heuristics. PeerJ Comput. Sci. 2024, 10, e2141. [Google Scholar] [CrossRef] [PubMed]
Zhao, F.; Liu, Y.; Zhu, N.; Xu, T.; Xu, J. A selection hyper-heuristic algorithm with Q-learning mechanism. Appl. Soft Comput. 2023, 147, 110815. [Google Scholar] [CrossRef]
Zhu, N.; Zhao, F.; Yu, Y.; Wang, L. A hierarchical reinforcement learning-aware hyper-heuristic algorithm with fitness landscape analysis. Swarm Evol. Comput. 2024, 90, 101669. [Google Scholar] [CrossRef]
Durgut, R.; Aydin, M.E.; Rakib, A. Transfer Learning for Operator Selection: A Reinforcement Learning Approach. Algorithms 2022, 15, 24. [Google Scholar] [CrossRef]
Alsuhli, G.; Banawan, K.; Attiah, K.; Elezabi, A.; Seddik, K.; Gaber, A.; Zaki, M.; Gadallah, Y. Mobility Load Management in Cellular Networks: A Deep Reinforcement Learning Approach. IEEE Trans. Mob. Comput. 2021, 22, 1581–1598. [Google Scholar] [CrossRef]
Ramesh, P.; Bhuvaneswari, P.; Dhanushree, V.; Gokul, G.; Sahana, S. User association-based load balancing using reinforcement learning in 5G heterogeneous networks. J. Supercomput. 2025, 81, 328. [Google Scholar] [CrossRef]
Ji, J.; Cai, L.; Zhu, K.; Niyato, D. Decoupled Association With Rate Splitting Multiple Access in UAV-Assisted Cellular Networks Using Multi-Agent Deep Reinforcement Learning. IEEE Trans. Mob. Comput. 2024, 23, 2186–2201. [Google Scholar] [CrossRef]
Xiao, Y.; Song, Y.; Liu, J. Multi-Agent Deep Reinforcement Learning Based Resource Allocation for Ultra-Reliable Low-Latency Internet of Controllable Things. IEEE Trans. Wirel. Commun. 2023, 22, 5414–5430. [Google Scholar] [CrossRef]
Noh, H.; Lee, H.; Yang, H.J. Joint Optimization on Uplink OFDMA and MU-MIMO for IEEE 802.11ax: Deep Hierarchical Reinforcement Learning Approach. IEEE Commun. Lett. 2024, 28, 1800–1804. [Google Scholar] [CrossRef]
Geng, Y.; Liu, E.; Wang, R.; Liu, Y. Hierarchical Reinforcement Learning for Relay Selection and Power Optimization in Two-Hop Cooperative Relay Network. IEEE Trans. Commun. 2022, 70, 171–184. [Google Scholar] [CrossRef]
Zhang, H.; Wang, W.; Zhou, H.; Lu, Z.; Li, M. A Hierarchical DRL Approach for Resource Optimization in Multi-RIS Multi-Operator Networks. IEEE Trans. Wirel. Commun. 2025, 24, 4981–4995. [Google Scholar] [CrossRef]
Nucci, F.; Papadia, G. Bi-Objective Optimization for Scalable Resource Scheduling in Dense IoT Deployments via 5G Network Slicing Using NSGA-II. Telecom 2026, 7, 24. [Google Scholar] [CrossRef]
3rd Generation Partnership Project. Study on Channel Model for Frequencies from 0.5 to 100 GHz; Technical Report TR 38.901 V19.3.0, Release 19; European Telecommunications Standards Institute (ETSI): Sophia Antipolis, France, 2026. [Google Scholar]
3GPP. Study on RAN Improvements for Machine-Type Communications; Technical Report TR 37.868; European Telecommunications Standards Institute (ETSI): Sophia Antipolis, France, 2011. [Google Scholar]
Kumar, A.; Vidal, J.R.; Martinez-Bauset, J.; Li, F.Y. Semi-Contention-Free Access in IoT NOMA Networks: A Reinforcement Learning Framework. IEEE Trans. Commun. 2025, 73, 14413–14429. [Google Scholar] [CrossRef]

Figure 1. Two-layer resource-management model. At the fast scheduling layer (every 1 ms slot), a fixed token-weighted scheduler serves direct meter-to-BS transmissions using

C_{BS}

RBs and DAP backhaul transmissions using

C_{DAP}

RBs, with each backhaul RB aggregating up to A packets. At the slow association layer (every

T_{e}

slots), a meter-mode assignment policy—a learned controller or a classical heuristic—assigns each meter to direct or DAP mode, and this configuration is passed down to the scheduling layer. The figure is illustrative; the evaluated system has 1500 m and six DAPs.

Figure 1. Two-layer resource-management model. At the fast scheduling layer (every 1 ms slot), a fixed token-weighted scheduler serves direct meter-to-BS transmissions using

C_{BS}

RBs and DAP backhaul transmissions using

C_{DAP}

RBs, with each backhaul RB aggregating up to A packets. At the slow association layer (every

T_{e}

slots), a meter-mode assignment policy—a learned controller or a classical heuristic—assigns each meter to direct or DAP mode, and this configuration is passed down to the scheduling layer. The figure is illustrative; the evaluated system has 1500 m and six DAPs.

Figure 2. Packet delivery ratio for all ten policies across the three traffic scenarios. Bars show the mean over five seeds and ten episodes; error bars show the standard deviation, and the dashed line marks the

90 %

reference. The two fixed extremes (DirectOnly, DAPOnly) are clearly inadequate, BufferAware is the strongest heuristic, and the learned PPO and DQN controllers sit above it in every scenario, with the largest margins under high load (S2) and bursty cycles (S3).

Figure 2. Packet delivery ratio for all ten policies across the three traffic scenarios. Bars show the mean over five seeds and ten episodes; error bars show the standard deviation, and the dashed line marks the

90 %

reference. The two fixed extremes (DirectOnly, DAPOnly) are clearly inadequate, BufferAware is the strongest heuristic, and the learned PPO and DQN controllers sit above it in every scenario, with the largest margins under high load (S2) and bursty cycles (S3).

Figure 3. Deadline-miss rate (lower is better) for all ten policies across the three traffic scenarios. Bars show the mean over five seeds and ten episodes; error bars show the standard deviation. The learned PPO and DQN controllers achieve the lowest miss rate in every scenario, mirroring their delivery advantage in Figure 2.

Figure 4. Training convergence of PPO and DQN, showing mean episode reward against training steps (mean ± std across five seeds). Both controllers converge well before the training budget is exhausted, confirming that the reported performance reflects converged rather than under-trained policies.

Figure 5. Heat map of mean PDR over policies (rows) and scenarios (columns). The progression from cool to warm cells down each column shows the performance ordering; PPO and DQN occupy the warmest cells in every scenario.

Figure 6. Per-episode PDR distributions (across episodes and seeds) for each policy and scenario. The tight, clearly separated boxes for PPO and DQN relative to BufferAware indicates that the advantage is consistent across episodes and seeds rather than driven by outliers. The red dashed line marks the 90% packet-delivery-ratio reference (the service target).

Figure 7. Robustness sweeps on the four key parameters. (a) Strategy-library size K across the three traffic regimes (solid: PPO; dashed: BufferAware), showing diminishing returns and saturation by

K = 4

–5. (b) Number of DAPs D, (c) cell radius, and (d) aggregation factor A, all on the baseline scenario. The learned controller (PPO) remains at or above the strongest heuristic across every setting.

Figure 7. Robustness sweeps on the four key parameters. (a) Strategy-library size K across the three traffic regimes (solid: PPO; dashed: BufferAware), showing diminishing returns and saturation by

K = 4

–5. (b) Number of DAPs D, (c) cell radius, and (d) aggregation factor A, all on the baseline scenario. The learned controller (PPO) remains at or above the strongest heuristic across every setting.

Figure 8. Action-space size of the regional reformulation versus the fully flexible per-meter space. (a) The number of regional joint actions

K^{D}

on a logarithmic axis, growing from 729 at

K = 3

to

15,625

at

K = 5

(here

D = 6

). (b) The same regional count for

K = 5

, shown as

{log}_{10}

, beside the per-meter space

2^{N} \approx 10^{452}

at

N = 1500

; the two panels use different vertical scales, so the

K = 5

bar that reads

15,625

in panel (a) appears as

10^{4.2}

in panel (b). The regional space is smaller than the per-meter space by more than 447 orders of magnitude and is independent of N, depending only on K and D.

Figure 8. Action-space size of the regional reformulation versus the fully flexible per-meter space. (a) The number of regional joint actions

K^{D}

on a logarithmic axis, growing from 729 at

K = 3

to

15,625

at

K = 5

(here

D = 6

). (b) The same regional count for

K = 5

, shown as

{log}_{10}

, beside the per-meter space

2^{N} \approx 10^{452}

at

N = 1500

; the two panels use different vertical scales, so the

K = 5

bar that reads

15,625

in panel (a) appears as

10^{4.2}

in panel (b). The regional space is smaller than the per-meter space by more than 447 orders of magnitude and is independent of N, depending only on K and D.

Figure 9. Distribution of regional strategies selected by the trained PPO policy, by scenario (mean ± std across seeds). The policy spreads its choices across all five strategies rather than collapsing to one, confirming that it comprises heuristics. The shift toward the deadline-aware strategy under the bursty-cycle scenario reflects the Greater deadline pressure due to sharp congestion peaks.

Figure 10. Trade-off between association switching cost (switches per slot, horizontal axis) and PDR (vertical axis) for each policy and scenario. Policies toward the upper region achieve higher delivery; the learned controllers attain the highest PDR at a switching cost comparable to the strongest heuristic, while HybridMeter switches least at lower PDR. The red dashed line marks the 90% packet-delivery-ratio reference (the service target).

Figure 11. Scalability with meter count N. With radio resources provisioned to the offered load, the mean PDR (left axis, solid) is maintained as N grows from 1500 to 5000, while the regional action-space size

K^{D}

(right axis, dashed) remains constant at

15,625

. Training time grows only moderately and the policy size is independent of N.

Figure 11. Scalability with meter count N. With radio resources provisioned to the offered load, the mean PDR (left axis, solid) is maintained as N grows from 1500 to 5000, while the regional action-space size

K^{D}

(right axis, dashed) remains constant at

15,625

. Training time grows only moderately and the policy size is independent of N.

Figure 12. Operating-point view for the high-load scenario: the fraction of episodes meeting a delivery target

θ

(the empirical probability that

PDR \geq θ

) as

θ

varies. The learned controllers retain high compliance over a range of targets where the strongest The heuristic has already fallen off, showing that a small average margin yields a large difference in service-level compliance. The exact target a The utility adopted is policy-dependent; the curves show the comparison is robust across the plausible range.

Figure 12. Operating-point view for the high-load scenario: the fraction of episodes meeting a delivery target

θ

(the empirical probability that

PDR \geq θ

) as

θ

varies. The learned controllers retain high compliance over a range of targets where the strongest The heuristic has already fallen off, showing that a small average margin yields a large difference in service-level compliance. The exact target a The utility adopted is policy-dependent; the curves show the comparison is robust across the plausible range.

Table 1. Simulator parameters.

Parameter	Value
Meters N	1500
DAPs D	6
Cell radius	1000 m
Direct RBs $C_{BS}$	20
Backhaul RBs $C_{DAP}$	8
Aggregation factor A	4
Base arrival rate $λ^{0}$	$U [0.012, 0.030]$
Cycle period $T_{c}$	60 slots
Cycle peak multiplier m	6.0
Cycle peak width w	15 slots
DAP buffer capacity $β_{max}$	60
Deadline $T_{dl}$	25 slots
CSI update period	10 slots
Epoch length $T_{e}$	5 slots
Episode length	240 slots
Reward weights $(w_{p}, w_{m}, w_{s})$	$(1.0, 1.5, 0.02)$

Table 2. Packet delivery ratio (mean ± std over 5 seeds × 10 episodes). Best heuristic per column underlined; best overall in bold.

Policy	S1 (Baseline)	S2 (High Load)	S3 (Bursty Cycle)
DirectOnly	$0.5735 \pm 0.0044$	$0.4970 \pm 0.0030$	$0.4846 \pm 0.0027$
DAPOnly	$0.6189 \pm 0.0035$	$0.5395 \pm 0.0025$	$0.5329 \pm 0.0020$
DistThreshold	$0.7714 \pm 0.0065$	$0.7167 \pm 0.0067$	$0.7234 \pm 0.0060$
CycleAware	$0.8872 \pm 0.0058$	$0.7843 \pm 0.0039$	$0.7665 \pm 0.0038$
OracleCycle	$0.8872 \pm 0.0058$	$0.7843 \pm 0.0039$	$0.7665 \pm 0.0038$
DeadlineAware	$0.9119 \pm 0.0058$	$0.8266 \pm 0.0061$	$0.8033 \pm 0.0055$
HybridMeter	$0.9048 \pm 0.0052$	$0.8026 \pm 0.0036$	$0.7860 \pm 0.0039$
BufferAware	$0.9372 \pm 0.0054$	$0.8293 \pm 0.0168$	$0.8155 \pm 0.0060$
DQN	$0.9407 \pm 0.0044$	$0.8444 \pm 0.0046$	$0.8323 \pm 0.0033$
PPO	$0.9434 \pm 0.0046$	$0.8535 \pm 0.0048$	$0.8421 \pm 0.0041$

Table 3. Significance of the learned advantage over BufferAware (PDR). CI is the

95 %

confidence interval on the difference;

p_{t}

is the paired t-test over per-seed means; d is Cohen’s d over episodes.

Table 3. Significance of the learned advantage over BufferAware (PDR). CI is the

95 %

confidence interval on the difference;

p_{t}

is the paired t-test over per-seed means; d is Cohen’s d over episodes.

Scenario	Method	ΔPDR (pp)	95% CI (pp)	$p_{t}$	Cohen’s d	Sig.
S1 (baseline)	PPO	$+ 0.63$	$[+ 0.42, + 0.83]$	$3.4 \times 10^{- 4}$	$1.23$	***
S1 (baseline)	DQN	$+ 0.35$	$[+ 0.15, + 0.55]$	$8.0 \times 10^{- 3}$	$0.70$	**
S2 (high load)	PPO	$+ 2.41$	$[+ 1.91, + 2.91]$	$8.9 \times 10^{- 4}$	$1.94$	***
S2 (high load)	DQN	$+ 1.51$	$[+ 1.01, + 2.01]$	$5.1 \times 10^{- 3}$	$1.21$	**
S3 (bursty cycle)	PPO	$+ 2.66$	$[+ 2.45, + 2.86]$	$1.0 \times 10^{- 5}$	$5.12$	***
S3 (bursty cycle)	DQN	$+ 1.68$	$[+ 1.48, + 1.87]$	$4.7 \times 10^{- 5}$	$3.41$	***

**

p < 0.01

, ***

p < 0.001

.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Al-Ali, M.; Inga, E.; Inga, J.; Yaacoub, E. Regional Strategy Composition: A Hierarchical-Action Reinforcement Learning Framework for Dynamic Smart-Meter Association over 5G NR mMTC Networks. Future Internet 2026, 18, 337. https://doi.org/10.3390/fi18070337

AMA Style

Al-Ali M, Inga E, Inga J, Yaacoub E. Regional Strategy Composition: A Hierarchical-Action Reinforcement Learning Framework for Dynamic Smart-Meter Association over 5G NR mMTC Networks. Future Internet. 2026; 18(7):337. https://doi.org/10.3390/fi18070337

Chicago/Turabian Style

Al-Ali, Muhammed, Esteban Inga, Juan Inga, and Elias Yaacoub. 2026. "Regional Strategy Composition: A Hierarchical-Action Reinforcement Learning Framework for Dynamic Smart-Meter Association over 5G NR mMTC Networks" Future Internet 18, no. 7: 337. https://doi.org/10.3390/fi18070337

APA Style

Al-Ali, M., Inga, E., Inga, J., & Yaacoub, E. (2026). Regional Strategy Composition: A Hierarchical-Action Reinforcement Learning Framework for Dynamic Smart-Meter Association over 5G NR mMTC Networks. Future Internet, 18(7), 337. https://doi.org/10.3390/fi18070337

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Regional Strategy Composition: A Hierarchical-Action Reinforcement Learning Framework for Dynamic Smart-Meter Association over 5G NR mMTC Networks

Abstract

1. Introduction

2. Related Work

2.1. DAP Placement and Static Meter Assignment

2.2. Reinforcement Learning for Wireless Resource Management

2.3. Hierarchical RL and Heuristic Selection

2.4. Deadline-Aware mMTC Scheduling

3. System Model

3.1. Network Topology

3.2. Physical Layer

3.3. Traffic and Congestion Cycles

3.4. Two-Layer Resource Management

4. Problem Formulation

4.1. Optimization Problem

4.2. Markov Decision Process

4.3. The Per-Meter Action Space and Its Intractability

4.4. Regional Strategy Composition

4.5. Deployment Architecture and Information Model

5. Methodology

5.1. Strategy Library

5.2. Learning Algorithms

6. Experimental Setup

6.1. Calibration

6.2. Scenarios

6.3. Baselines

6.4. Protocol and Statistics

7. Analysis of Results

7.1. Overall Comparison

7.2. Statistical Significance

7.3. Deadline-Miss Rate

7.4. The Advantage Grows with Stress

7.5. Robustness and Sensitivity Analysis

7.5.1. Strategy-Library Size K

7.5.2. Number of DAPs D

7.5.3. Cell Radius

7.5.4. Aggregation Factor A

7.5.5. Action-Space Expressiveness

7.6. What the Policy Learned

7.7. Cost of the Learned Policy

7.8. Scalability with Meter Count

8. Discussion

8.1. Why Regional Composition Works

8.2. Relation to the Literature

8.3. Practical Implications

8.4. Operating-Point Analysis

8.5. Limitations and Future Work

9. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI