Regional Strategy Composition: A Hierarchical-Action Reinforcement Learning Framework for Dynamic Smart-Meter Association over 5G NR mMTC Networks
Abstract
1. Introduction
- We formalize the dynamic smart-meter association as a Markov decision process with deadline-bounded delivery, stale channel-state information (CSI), and phase-offset DAP congestion cycles, and we show that the naive per-meter action space is intractable.
- We propose a regional strategy-composition formulation that reduces the action space from to and preserves meter-level granularity through a deterministic strategy-to-mode mapping.
- We implement Proximal Policy Optimization (PPO) and Deep Q-Network (DQN) controllers under this formulation, and compare them on a single 5G NR-calibrated simulator, against eight meter-level baselines, including the strongest buffer-aware heuristic and a perfect-information cycle oracle.
- We report a statistically rigorous evaluation over five random seeds and three traffic regimes, with paired significance tests and effect sizes, and we show that the learned controller’s advantage over the best heuristic grows monotonically with traffic stress.
2. Related Work
2.1. DAP Placement and Static Meter Assignment
2.2. Reinforcement Learning for Wireless Resource Management
2.3. Hierarchical RL and Heuristic Selection
2.4. Deadline-Aware mMTC Scheduling
3. System Model
3.1. Network Topology
3.2. Physical Layer
3.3. Traffic and Congestion Cycles
3.4. Two-Layer Resource Management
4. Problem Formulation
4.1. Optimization Problem
4.2. Markov Decision Process
4.3. The Per-Meter Action Space and Its Intractability
4.4. Regional Strategy Composition
| Algorithm 1 Regional-to-meter expansion | |
| Require: regional action , state s | |
| Ensure: per-meter modes | |
| 1: for each meter do | |
| 2: | ▹ index of nearest DAP |
| 3: | ▹ strategy chosen for that region |
| 4: | ▹ apply rule with local state |
| 5: end for | |
| 6: return | |
4.5. Deployment Architecture and Information Model
5. Methodology
5.1. Strategy Library
- Direct-priority (): assign the direct path, . Useful when the region’s DAP is congested or unreliable.
- DAP-priority (): assign the DAP path when the link is adequate, and the buffer is not critical, .
- Buffer-aware (): use the DAP only if its buffer is safe, .
- Cycle-aware (): avoid the DAP during or shortly before its congestion peak, , where is the peak-and-approach phase window.
- Deadline-aware (): for urgent packets, choose the path with the higher immediate delivery probability given stale CSI and buffer state; otherwise, default to buffer-aware behavior.
5.2. Learning Algorithms
| Algorithm 2 Regional association control loop (one episode) | |
| Require: trained or learning policy , epoch length , episode length T | |
| 1: initialize meter modes, DAP buffers, packet queues | |
| 2: for epoch do | |
| 3: observe global state | |
| 4: sample regional action | |
| 5: | ▹ Algorithm 1 |
| 6: apply modes ; count switches | |
| 7: for slot do | |
| 8: generate arrivals with rate | |
| 9: serve direct group with RBs (token-weighted) | |
| 10: enqueue DAP-mode packets to nearest-DAP buffers | |
| 11: flush DAP buffers via backhaul RBs | |
| 12: drop packets exceeding deadline | |
| 13: end for | |
| 14: compute reward by (13) | |
| 15: if training then | |
| 16: store transition ; update | |
| 17: end if | |
| 18: end for | |
6. Experimental Setup
6.1. Calibration
6.2. Scenarios
- S1 (baseline). Nominal arrival rates and cycles. A control regime in which the system is comfortably provisioned.
- S2 (high load). Arrival rates are amplified network-wide, stressing both transmission paths so that no single rule dominates.
- S3 (bursty cycle). Sharper, more concentrated congestion peaks, so that the timing of DAP usage matters more, and the prediction of the cycle is more valuable.
6.3. Baselines
6.4. Protocol and Statistics
7. Analysis of Results
7.1. Overall Comparison
7.2. Statistical Significance
7.3. Deadline-Miss Rate
7.4. The Advantage Grows with Stress
7.5. Robustness and Sensitivity Analysis
7.5.1. Strategy-Library Size K
7.5.2. Number of DAPs D
7.5.3. Cell Radius
7.5.4. Aggregation Factor A
7.5.5. Action-Space Expressiveness
7.6. What the Policy Learned
7.7. Cost of the Learned Policy
7.8. Scalability with Meter Count
8. Discussion
8.1. Why Regional Composition Works
8.2. Relation to the Literature
8.3. Practical Implications
8.4. Operating-Point Analysis
8.5. Limitations and Future Work
9. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Molokomme, D.N.; Chabalala, C.S.; Bokoro, P.N. Enhancement of Advanced Metering Infrastructure Performance Using Unsupervised K-Means Clustering Algorithm. Energies 2021, 14, 2732. [Google Scholar] [CrossRef]
- Gallardo, J.; Ahmed, M.; Jara, N. Clustering Algorithm-Based Network Planning for Advanced Metering Infrastructure in Smart Grid. IEEE Access 2021, 9, 48992–49006. [Google Scholar] [CrossRef]
- Haque, M.M.; Tariq, F.; Khandaker, M.; Wong, K.K.; Zhang, Y. A Survey of Scheduling in 5G URLLC and Outlook for Emerging 6G Systems. IEEE Access 2023, 11, 34372–34396. [Google Scholar] [CrossRef]
- Elgarhy, O.; Reggiani, L.; Alam, M.M.; Zoha, A.; Ahmad, R.; Kuusik, A. Energy Efficiency and Latency Optimization for IoT URLLC and mMTC Use Cases. IEEE Access 2024, 12, 23132–23148. [Google Scholar] [CrossRef]
- Lang, A.; Wang, Y.; Feng, C.; Stai, E.; Hug, G. Data Aggregation Point Placement for Smart Meters in the Smart Grid. IEEE Trans. Smart Grid 2022, 13, 541–554. [Google Scholar] [CrossRef]
- Sung, T.W.; Xu, Y.; Hu, X.; Lee, C.S.; Fang, Q. Optimizing data aggregation point location with grid-based model for smart grids. J. Intell. Fuzzy Syst. 2022, 42, 3189–3201. [Google Scholar] [CrossRef]
- Inga, E.; Dai, Y.; Inga, J.; Zhang, K. Connectivity-Oriented Optimization of Scalable Wireless Sensor Topologies for Urban Smart Water Metering. Smart Cities 2025, 8, 167. [Google Scholar] [CrossRef]
- Inga, E.; Inga, J.; Ortega, A. Novel Approach Sizing and Routing of Wireless Sensor Networks for Applications in Smart Cities. Sensors 2021, 21, 4692. [Google Scholar] [CrossRef] [PubMed]
- Abdullah, A.; Ashraf, E. New Dual Algorithm to Placement the Data Aggregation Point for Smart Grid Meters. Smart Grids Sustain. Energy 2024, 9, 21. [Google Scholar] [CrossRef]
- Khan, A.; Umar, A.; Munir, A.; Shirazi, S.; Khan, M.; Adnan, M. A QoS-Aware Machine Learning-Based Framework for AMI Applications in Smart Grids. Energies 2021, 14, 8171. [Google Scholar] [CrossRef]
- Khan, A.; Shirazi, S.; Adeel, M.; Assam, M.; Ghadi, Y.; Mohamed, H.; Xie, Y. A QoS-Aware Data Aggregation Strategy for Resource Constrained IoT-Enabled AMI Network in Smart Grid. IEEE Access 2023, 11, 98988–99004. [Google Scholar] [CrossRef]
- Kim, B. A priority-aware dynamic scheduling algorithm for ensuring data freshness in 5G networks. Future Gener. Comput. Syst. 2025, 163, 107542. [Google Scholar] [CrossRef]
- Weerasinghe, T.; Casares-Giner, V.; Balapuwaduge, I.; Li, F. Priority Enabled Grant-Free Access With Dynamic Slot Allocation for Heterogeneous mMTC Traffic in 5G NR Networks. IEEE Trans. Commun. 2021, 69, 3192–3206. [Google Scholar] [CrossRef]
- Kaura, Y.; Lall, B.; Mallik, R.; Singhal, A. Adaptive Scheduling of Shared Grant-Free Resources for Heterogeneous Massive Machine Type Communication in 5G and Beyond Networks. IEEE Trans. Netw. Serv. Manag. 2025, 22, 1188–1204. [Google Scholar] [CrossRef]
- Eldeeb, E.; Shehab, M.; Alves, H. A Learning-Based Fast Uplink Grant for Massive IoT via Support Vector Machines and Long Short-Term Memory. IEEE Internet Things J. 2021, 9, 3889–3898. [Google Scholar] [CrossRef]
- Liu, S.; Pan, C.; Zhang, C.; Yang, F.; Song, J. Dynamic Spectrum Sharing Based on Deep Reinforcement Learning in Mobile Communication Systems. Sensors 2023, 23, 2622. [Google Scholar] [CrossRef] [PubMed]
- Lu, Z.; Zhong, C.; Gursoy, M.C. Dynamic Channel Access and Power Control in Wireless Interference Networks via Multi-Agent Deep Reinforcement Learning. IEEE Trans. Veh. Technol. 2021, 71, 1588–1601. [Google Scholar] [CrossRef]
- Chaieb, C.; Abdelkefi, F.; Ajib, W. Deep Reinforcement Learning for Resource Allocation in Multi-Band and Hybrid OMA-NOMA Wireless Networks. IEEE Trans. Commun. 2023, 71, 187–198. [Google Scholar] [CrossRef]
- Alizadeh, A.; Lim, B.; Vu, M. Multi-Agent Q-Learning for Real-Time Load Balancing User Association and Handover in Mobile Networks. IEEE Trans. Wirel. Commun. 2024, 23, 9001–9015. [Google Scholar] [CrossRef]
- Wang, D.; Li, R.; Huang, C.; Xu, X.; Chen, H. User Association and Power Allocation for User-Centric Smart-Duplex Networks via Tree-Structured Deep Reinforcement Learning. IEEE Internet Things J. 2023, 10, 20216–20229. [Google Scholar] [CrossRef]
- Tao, Z.; Xu, W.; You, X. Large Vision Model-Enhanced Digital Twin With Deep Reinforcement Learning for User Association and Load Balancing in Dynamic Wireless Networks. IEEE J. Sel. Areas Commun. 2026, 44, 2718–2732. [Google Scholar] [CrossRef]
- Hsieh, C.K.; Chan, K.L.; Chien, F.T. Energy-Efficient Power Allocation and User Association in Heterogeneous Networks with Deep Reinforcement Learning. Appl. Sci. 2021, 11, 4135. [Google Scholar] [CrossRef]
- Giwa, O.; Awodunmila, T.; Mohsin, M.; Bilal, A.; Jamshed, M. Meta-Reinforcement Learning for Fast and Data-Efficient Spectrum Allocation in Dynamic Wireless Networks. IEEE Wirel. Commun. Lett. 2026, 15, 2000–2004. [Google Scholar] [CrossRef]
- Hutsebaut-Buysse, M.; Mets, K.; Latré, S. Hierarchical Reinforcement Learning: A Survey and Open Research Challenges. Mach. Learn. Knowl. Extr. 2022, 4, 172–221. [Google Scholar] [CrossRef]
- Eppe, M.; Gumbsch, C.; Kerzel, M.; Nguyen, P.; Butz, M.; Wermter, S. Intelligent problem-solving as integrated hierarchical reinforcement learning. Nat. Mach. Intell. 2022, 4, 11–20. [Google Scholar] [CrossRef]
- Wang, H.; Wang, J. Enhancing multi-UAV air combat decision making via hierarchical reinforcement learning. Sci. Rep. 2024, 14, 4458. [Google Scholar] [CrossRef] [PubMed]
- Yang, Y.; Shi, Y.; Cui, X.; Li, J.; Zhao, X. A Hybrid Decision-Making Framework for UAV-Assisted MEC Systems. Drones 2025, 9, 206. [Google Scholar] [CrossRef]
- Yi, W.; Qu, R.; Jiao, L.; Niu, B. Automated Design of Metaheuristics Using Reinforcement Learning Within a Novel General Search Framework. IEEE Trans. Evol. Comput. 2023, 27, 1072–1084. [Google Scholar] [CrossRef]
- Tian, Y.; Li, X.; Ma, H.; Zhang, X.; Tan, K.C.; Jin, Y. Deep Reinforcement Learning Based Adaptive Operator Selection for Evolutionary Multi-Objective Optimization. IEEE Trans. Emerg. Top. Comput. Intell. 2023, 7, 1051–1064. [Google Scholar] [CrossRef]
- Guo, H.; Ma, Y.; Zhang, Z.; Chen, J.; Zhang, X.; Cao, Z.; Zhang, J.; Gong, Y. Deep Reinforcement Learning for Dynamic Algorithm Selection: A Proof-of-Principle Study on Differential Evolution. IEEE Trans. Syst. Man Cybern. Syst. 2024, 54, 4247–4259. [Google Scholar] [CrossRef]
- Li, C.; Wei, X.; Wang, J.; Wang, S.; Zhang, S. A review of reinforcement learning based hyper-heuristics. PeerJ Comput. Sci. 2024, 10, e2141. [Google Scholar] [CrossRef] [PubMed]
- Zhao, F.; Liu, Y.; Zhu, N.; Xu, T.; Xu, J. A selection hyper-heuristic algorithm with Q-learning mechanism. Appl. Soft Comput. 2023, 147, 110815. [Google Scholar] [CrossRef]
- Zhu, N.; Zhao, F.; Yu, Y.; Wang, L. A hierarchical reinforcement learning-aware hyper-heuristic algorithm with fitness landscape analysis. Swarm Evol. Comput. 2024, 90, 101669. [Google Scholar] [CrossRef]
- Durgut, R.; Aydin, M.E.; Rakib, A. Transfer Learning for Operator Selection: A Reinforcement Learning Approach. Algorithms 2022, 15, 24. [Google Scholar] [CrossRef]
- Alsuhli, G.; Banawan, K.; Attiah, K.; Elezabi, A.; Seddik, K.; Gaber, A.; Zaki, M.; Gadallah, Y. Mobility Load Management in Cellular Networks: A Deep Reinforcement Learning Approach. IEEE Trans. Mob. Comput. 2021, 22, 1581–1598. [Google Scholar] [CrossRef]
- Ramesh, P.; Bhuvaneswari, P.; Dhanushree, V.; Gokul, G.; Sahana, S. User association-based load balancing using reinforcement learning in 5G heterogeneous networks. J. Supercomput. 2025, 81, 328. [Google Scholar] [CrossRef]
- Ji, J.; Cai, L.; Zhu, K.; Niyato, D. Decoupled Association With Rate Splitting Multiple Access in UAV-Assisted Cellular Networks Using Multi-Agent Deep Reinforcement Learning. IEEE Trans. Mob. Comput. 2024, 23, 2186–2201. [Google Scholar] [CrossRef]
- Xiao, Y.; Song, Y.; Liu, J. Multi-Agent Deep Reinforcement Learning Based Resource Allocation for Ultra-Reliable Low-Latency Internet of Controllable Things. IEEE Trans. Wirel. Commun. 2023, 22, 5414–5430. [Google Scholar] [CrossRef]
- Noh, H.; Lee, H.; Yang, H.J. Joint Optimization on Uplink OFDMA and MU-MIMO for IEEE 802.11ax: Deep Hierarchical Reinforcement Learning Approach. IEEE Commun. Lett. 2024, 28, 1800–1804. [Google Scholar] [CrossRef]
- Geng, Y.; Liu, E.; Wang, R.; Liu, Y. Hierarchical Reinforcement Learning for Relay Selection and Power Optimization in Two-Hop Cooperative Relay Network. IEEE Trans. Commun. 2022, 70, 171–184. [Google Scholar] [CrossRef]
- Zhang, H.; Wang, W.; Zhou, H.; Lu, Z.; Li, M. A Hierarchical DRL Approach for Resource Optimization in Multi-RIS Multi-Operator Networks. IEEE Trans. Wirel. Commun. 2025, 24, 4981–4995. [Google Scholar] [CrossRef]
- Nucci, F.; Papadia, G. Bi-Objective Optimization for Scalable Resource Scheduling in Dense IoT Deployments via 5G Network Slicing Using NSGA-II. Telecom 2026, 7, 24. [Google Scholar] [CrossRef]
- 3rd Generation Partnership Project. Study on Channel Model for Frequencies from 0.5 to 100 GHz; Technical Report TR 38.901 V19.3.0, Release 19; European Telecommunications Standards Institute (ETSI): Sophia Antipolis, France, 2026. [Google Scholar]
- 3GPP. Study on RAN Improvements for Machine-Type Communications; Technical Report TR 37.868; European Telecommunications Standards Institute (ETSI): Sophia Antipolis, France, 2011. [Google Scholar]
- Kumar, A.; Vidal, J.R.; Martinez-Bauset, J.; Li, F.Y. Semi-Contention-Free Access in IoT NOMA Networks: A Reinforcement Learning Framework. IEEE Trans. Commun. 2025, 73, 14413–14429. [Google Scholar] [CrossRef]












| Parameter | Value |
|---|---|
| Meters N | 1500 |
| DAPs D | 6 |
| Cell radius | 1000 m |
| Direct RBs | 20 |
| Backhaul RBs | 8 |
| Aggregation factor A | 4 |
| Base arrival rate | |
| Cycle period | 60 slots |
| Cycle peak multiplier m | 6.0 |
| Cycle peak width w | 15 slots |
| DAP buffer capacity | 60 |
| Deadline | 25 slots |
| CSI update period | 10 slots |
| Epoch length | 5 slots |
| Episode length | 240 slots |
| Reward weights |
| Policy | S1 (Baseline) | S2 (High Load) | S3 (Bursty Cycle) |
|---|---|---|---|
| DirectOnly | |||
| DAPOnly | |||
| DistThreshold | |||
| CycleAware | |||
| OracleCycle | |||
| DeadlineAware | |||
| HybridMeter | |||
| BufferAware | |||
| DQN | |||
| PPO |
| Scenario | Method | ΔPDR (pp) | 95% CI (pp) | Cohen’s d | Sig. | |
|---|---|---|---|---|---|---|
| S1 (baseline) | PPO | *** | ||||
| DQN | ** | |||||
| S2 (high load) | PPO | *** | ||||
| DQN | ** | |||||
| S3 (bursty cycle) | PPO | *** | ||||
| DQN | *** |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Al-Ali, M.; Inga, E.; Inga, J.; Yaacoub, E. Regional Strategy Composition: A Hierarchical-Action Reinforcement Learning Framework for Dynamic Smart-Meter Association over 5G NR mMTC Networks. Future Internet 2026, 18, 337. https://doi.org/10.3390/fi18070337
Al-Ali M, Inga E, Inga J, Yaacoub E. Regional Strategy Composition: A Hierarchical-Action Reinforcement Learning Framework for Dynamic Smart-Meter Association over 5G NR mMTC Networks. Future Internet. 2026; 18(7):337. https://doi.org/10.3390/fi18070337
Chicago/Turabian StyleAl-Ali, Muhammed, Esteban Inga, Juan Inga, and Elias Yaacoub. 2026. "Regional Strategy Composition: A Hierarchical-Action Reinforcement Learning Framework for Dynamic Smart-Meter Association over 5G NR mMTC Networks" Future Internet 18, no. 7: 337. https://doi.org/10.3390/fi18070337
APA StyleAl-Ali, M., Inga, E., Inga, J., & Yaacoub, E. (2026). Regional Strategy Composition: A Hierarchical-Action Reinforcement Learning Framework for Dynamic Smart-Meter Association over 5G NR mMTC Networks. Future Internet, 18(7), 337. https://doi.org/10.3390/fi18070337

