Safety-Constrained Reinforcement Learning for Energy-Aware Transmission Scheduling in Seismic Wireless Sensor Networks
Abstract
1. Introduction
- A guard-layer safety filter that operates as a runtime constraint enforcement mechanism, intercepting actions from the learned policy and substituting safe alternatives when battery or load-balance constraints are violated. The guard layer requires no retraining and can be configured independently of the RL agent.
- An action-masked PPO formulation for multi-node transmission scheduling that handles variable network topology through fixed-size observation padding and dynamic action masking, enabling a single trained model to operate across networks of different sizes.
- A comprehensive empirical evaluation across three network scales (10, 15, and 30 nodes) with realistic solar energy harvesting, demonstrating that the guard-enhanced PPO achieves superior transmission success and survival compared to four heuristic baselines.
2. Related Work
2.1. Energy-Aware WSN Protocols
2.2. Reinforcement Learning for WSN Optimisation
2.3. Safe Reinforcement Learning
3. System Model
3.1. Network Architecture
3.2. Energy Model
3.3. Event Model
3.4. Energy Balance Analysis
3.5. Communication Model
4. Reinforcement Learning Formulation
4.1. Markov Decision Process
- for correct transmission (event active, node detects it), for incorrect transmission (wrong node or dead node), for missed event;
- per transmission (energy conservation penalty) and (distance penalty favouring proximal nodes);
- per alive node (survival incentive);
- where is the standard deviation of battery levels (load-balancing incentive);
- per dead cluster, if all nodes are dead.
4.2. PPO with Generalised Advantage Estimation
5. Proposed Architecture
5.1. Network Architecture and Action Masking
5.2. System Pipeline
5.3. Guard-Layer Safety Filter
5.3.1. Safety Score Computation
5.3.2. Action Filtering with Relaxation
6. Experimental Setup
6.1. Simulation Environment
6.2. Training Configuration
6.3. Guard-Layer Hyperparameter Selection
6.4. Baseline Policies
| Algorithm 1 Guard-Layer Action Filtering |
|
6.5. Evaluation Metrics
7. Results
7.1. Aggregate Performance Across Scales
7.2. Cumulative Reward Comparison
7.3. Transmission Success and Node Survival
7.4. Load-Fairness Analysis
7.5. Percentage Improvements at 30 Nodes
7.6. Temporal Dynamics at 30 Nodes
7.7. Reward–Survival Trade-Off
7.8. Sensitivity to Event Probability
7.9. Statistical Robustness and Long-Tail Behaviour
8. Discussion
8.1. Scale-Dependent Performance Regimes
8.2. Reward Function Design Rationale
8.3. Guard-Layer Effectiveness
8.4. Temporal vs. Spatial Fairness
8.5. Action Masking and Sample Efficiency
8.6. Comparison with Constrained Policy Optimisation and Lagrangian-Relaxed PPO
8.7. Simulation Fidelity
8.8. Scalability Considerations
9. Limitations and Future Work
10. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| WSN | Wireless Sensor Network |
| RL | Reinforcement Learning |
| PPO | Proximal Policy Optimisation |
| GAE | Generalised Advantage Estimation |
| JFI | Jain’s Fairness Index |
| PVGIS | Photovoltaic Geographical Information System |
| MDP | Markov Decision Process |
| CPO | Constrained Policy Optimisation |
References
- Akyildiz, I.F.; Su, W.; Sankarasubramaniam, Y.; Cayirci, E. Wireless sensor networks: A survey. Comput. Netw. 2002, 38, 393–422. [Google Scholar] [CrossRef]
- Culler, D.; Estrin, D.; Srivastava, M. Overview of sensor networks. IEEE Comput. 2004, 37, 41–49. [Google Scholar] [CrossRef]
- Jornet-Monteverde, J.A.; Galiana-Merino, J.J.; Soler-Llorens, J.L. Design and implementation of a wireless sensor network for seismic monitoring of buildings. Sensors 2021, 21, 3875. [Google Scholar] [CrossRef]
- Sethi, P.; Sarangi, S.R. Internet of Things: Architectures, protocols, and applications. J. Elect. Comput. Eng. 2017, 2017, 9324035. [Google Scholar] [CrossRef]
- Hahne, E.L. Round-robin scheduling for max-min fairness in data networks. IEEE J. Sel. Areas Commun. 1991, 9, 1024–1039. [Google Scholar] [CrossRef]
- Heinzelman, W.R.; Chandrakasan, A.; Balakrishnan, H. Energy-efficient communication protocol for wireless microsensor networks. In Proceedings of the 33rd HICSS, Maui, HI, USA, 4–7 January 2000; pp. 3005–3014. [Google Scholar]
- Lindsey, S.; Raghavendra, C.S. PEGASIS: Power-efficient gathering in sensor information systems. In Proceedings of the IEEE Aerospace Conference, Big Sky, MT, USA, 9–16 March 2002; pp. 1125–1130. [Google Scholar]
- Li, Y. Deep reinforcement learning: An overview. arXiv 2017, arXiv:1701.07274. [Google Scholar]
- Guo, W.; Yan, C.; Lu, T. Optimizing the lifetime of wireless sensor networks via reinforcement-learning-based routing. Int. J. Distrib. Sens. Netw. 2019, 15, 1550147719833541. [Google Scholar] [CrossRef]
- Ye, W.; Heidemann, J.; Estrin, D. An energy-efficient MAC protocol for wireless sensor networks. In Proceedings of the IEEE INFOCOM, New York, NY, USA, 23–27 June 2002; Volume 3, pp. 1567–1576. [Google Scholar]
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
- Schulman, J.; Moritz, P.; Levine, S.; Jordan, M.; Abbeel, P. High-dimensional continuous control using generalized advantage estimation. In Proceedings of the ICLR, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
- Huang, S.; Ontañón, S. A closer look at invalid action masking in policy gradient algorithms. In Proceedings of the FLAIRS, Jensen Beach, FL, USA, 15–18 May 2022. [Google Scholar]
- Guo, H.; Wu, R.; Qi, B.; Xu, C. Deep-Q-Networks-based adaptive dual-mode energy-efficient routing in rechargeable wireless sensor networks. IEEE Sens. J. 2022, 22, 9956–9966. [Google Scholar] [CrossRef]
- Barat, A.; Prabuchandran, K.J.; Bhatnagar, S. Energy management in a cooperative energy harvesting wireless sensor network. IEEE Commun. Lett. 2024, 28, 243–247. [Google Scholar] [CrossRef]
- Al-Abiad, M.A.; Khan, A.Z.; De Domenico, M. Deep reinforcement learning-based energy consumption optimization for peer-to-peer communication in wireless sensor networks. Sensors 2024, 24, 1632. [Google Scholar]
- Alagha, A.; Singh, S.; Mizouni, R.; Otrok, S.S.; Mourad, A. Data-driven dynamic active node selection for event localization in IoT applications—A case study of radiation monitoring. IEEE Access 2019, 7, 16168–16183. [Google Scholar] [CrossRef]
- Wang, Y.; Xie, S.; Liu, X. Multi-agent deep reinforcement learning for task offloading in group M2M communications. Sensors 2021, 21, 3226. [Google Scholar]
- Luong, N.C.; Hoang, D.T.; Gong, S.; Niyato, D.; Wang, P.; Liang, Y.-C.; Kim, D.I. Applications of deep reinforcement learning in communications and networking: A survey. IEEE Commun. Surv. Tutor. 2019, 21, 3133–3174. [Google Scholar] [CrossRef]
- Liu, Z.; Chen, Y.; Chen, B.; Zhu, S.; Yang, K. Multi-agent reinforcement learning for long-term network resource management in IoT. IEEE Trans. Mob. Comput. 2023, 22, 3264–3279. [Google Scholar]
- García, J.; Fernández, F. A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 2015, 16, 1437–1480. [Google Scholar]
- Altman, E. Constrained Markov Decision Processes; Chapman & Hall/CRC: Boca Raton, FL, USA, 1999. [Google Scholar]
- Achiam, J.; Held, D.; Tamar, A.; Abbeel, P. Constrained policy optimization. In Proceedings of the ICML, Sydney, Australia, 6–11 August 2017; pp. 22–31. [Google Scholar]
- Alshiekh, M.; Bloem, R.; Ehlers, R.; Könighofer, B.; Niekum, S.; Topcu, U. Safe reinforcement learning via shielding. In Proceedings of the AAAI, New Orleans, LA, USA, 2–7 February 2018; pp. 2669–2678. [Google Scholar]
- Gu, S.; Yang, L.; Du, Y.; Chen, G.; Walter, F.; Wang, J.; Knoll, A. A review of safe reinforcement learning: Methods, theory and applications. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 11216–11235. [Google Scholar] [CrossRef] [PubMed]
- Xiao, W.; Molnar, T.G.; Orosz, G.; Ames, A.D. Barriernet: A safety-guaranteed layer for neural networks. Proc. Mach. Learn. Res. 2023, 168, 1–11. [Google Scholar]
- India Meteorological Department. Seismological Bulletin; National Seismological Network: New Delhi, India, 2023. Available online: https://seismo.gov.in/ (accessed on 15 March 2026).
- Huld, T.; Müller, R.; Gambardella, A. A new solar radiation database for estimating PV performance in Europe and Africa. Sol. Energy 2012, 86, 1803–1815. [Google Scholar] [CrossRef]
- Lei, J.; Yates, R.; Greenstein, L. A generic model for optimizing single-hop transmission policy of replenishable sensors. IEEE Trans. Wirel. Commun. 2009, 8, 547–556. [Google Scholar] [CrossRef]
- Allen, R.V. Automatic earthquake recognition and timing from single traces. Bull. Seismol. Soc. Am. 1978, 68, 1521–1532. [Google Scholar] [CrossRef]
- Trnkoczy, A. Understanding and parameter setting of STA/LTA trigger algorithm. In New Manual of Seismological Observatory Practice 2 (NMSOP-2); IASPEI: Kjeller, Norway, 2012; pp. 1–20. [Google Scholar]
- Mousavi, S.M.; Ellsworth, W.L.; Zhu, W.; Chuang, L.Y.; Beroza, G.C. Earthquake transformer—An attentive deep-learning model for simultaneous earthquake detection and phase picking. Nat. Commun. 2020, 11, 3952. [Google Scholar] [CrossRef]
- Raffin, A.; Hill, A.; Gleave, A.; Kanervisto, A.; Ernestus, M.; Dormann, N. Stable-Baselines3: Reliable reinforcement learning implementations. J. Mach. Learn. Res. 2021, 22, 1–8. [Google Scholar]
- Jain, R.; Chiu, D.-M.; Hawe, W.R. A Quantitative Measure of Fairness and Discrimination for Resource Allocation in Shared Computer Systems; Technical Report TR-301; DEC Research: Palo Alto, CA USA, 1984. [Google Scholar]
- Ray, A.; Achiam, J.; Amodei, D. Benchmarking Safe Exploration in Deep Reinforcement Learning; OpenAI Technical Report; OpenAI: San Francisco, CA, USA, 2019. [Google Scholar]
- Stooke, A.; Achiam, J.; Abbeel, P. Responsive safety in reinforcement learning by PID Lagrangian methods. In Proceedings of the ICML, Online, 13–18 July 2020; pp. 9133–9143. [Google Scholar]
- Urraca, M.; Huld, T.; Gracia-Amillo, A.; Martinez-de-Pison, F.J.; Kaspar, F.; Sanz-Garcia, A. Evaluation of global horizontal irradiance estimates from ERA5 and COSMO-REA6 reanalyses using ground and satellite-based data. Sol. Energy 2019, 188, 1049–1062. [Google Scholar] [CrossRef]
- Chen, M.; Rincón-Mora, G.A. Accurate electrical battery model capable of predicting runtime and I–V performance. IEEE Trans. Energy Convers. 2006, 21, 504–511. [Google Scholar] [CrossRef]
- Rappaport, T.S. Wireless Communications: Principles and Practice, 2nd ed.; Prentice Hall: Upper Saddle River, NJ, USA, 2002. [Google Scholar]
- Anastasi, G.; Conti, M.; Di Francesco, M.; Passarella, A. Energy conservation in wireless sensor networks: A survey. Ad Hoc Netw. 2009, 7, 537–568. [Google Scholar] [CrossRef]











| Parameter | Symbol | Value |
|---|---|---|
| Battery capacity | 15,000 J | |
| Idle power | 2.5 mW | |
| Idle cost/step | 9.0 J | |
| Base TX cost | 3.0 J | |
| TX stress growth | 0.0015 | |
| Aging rate | ||
| Panel efficiency | 12% | |
| Panel area | A | 1.2 |
| State-report scale | — | 10% of TX cost |
| Peak harvest | 49.8 J/h | |
| Detection radius | 250 km | |
| Event probability | 0.9 |
| Parameter | Value |
|---|---|
| Learning rate | (linear decay) |
| Network architecture | MLP [256, 256] |
| Batch size | 256 |
| Steps per rollout | 8760 (one full episode) |
| Epochs per update | 10 |
| Entropy coefficient | 0.02 |
| Value function coefficient | 0.5 |
| Clipping parameter | 0.2 |
| Discount factor | 0.99 |
| GAE parameter | 0.95 |
| Survival (%) | Tx (%) | Reward | Load-Balance CV | Final Alive (Mean) | |
|---|---|---|---|---|---|
| 0.0 | 65.47 | 99.34 | 8280.22 | 1.332 | 19.64 |
| 0.1 | 65.47 | 99.34 | 8280.22 | 1.332 | 19.64 |
| 0.2 | 65.47 | 99.34 | 8280.22 | 1.332 | 19.64 |
| 0.3 | 65.47 | 99.34 | 8280.22 | 1.332 | 19.64 |
| 0.5 | 65.47 | 99.34 | 8280.22 | 1.332 | 19.64 |
| 1.0 | 62.80 | 99.26 | 8258.21 | 1.687 | 18.84 |
| Scale | Policy | Reward | Tx Succ. (%) | Surv. (%) | Avg Batt. (%) | Load JFI ‡ |
|---|---|---|---|---|---|---|
| 10 nodes | Fixed † | 23.54 | 100.00 | 93.78 | 1.000 | |
| Round-Robin | 100.00 | 100.00 | 93.77 | 1.000 | ||
| Closest | 100.00 | 100.00 | 93.77 | 1.000 | ||
| RL (PPO) | 100.00 | 100.00 | 93.76 | 1.000 | ||
| RL+Guard | 100.00 | 100.00 | 93.77 | 1.000 | ||
| 15 nodes | Fixed † | 31.88 | 66.53 | 75.25 | 0.596 | |
| Round-Robin | 95.70 | 36.53 | 89.56 | 0.399 | ||
| Closest | 95.08 | 34.80 | 91.20 | 0.394 | ||
| RL (PPO) | 96.74 | 54.13 | 78.90 | 0.186 | ||
| RL+Guard | 97.58 | 62.93 | 73.69 | 0.398 | ||
| 30 nodes | Fixed † | 22.71 | 69.20 | 75.62 | 0.669 | |
| Round-Robin | 98.74 | 52.87 | 75.51 | 0.425 | ||
| Closest | 97.33 | 42.00 | 84.49 | 0.381 | ||
| RL (PPO) | 98.67 | 57.60 | 78.40 | 0.191 | ||
| RL+Guard | 99.46 | 66.47 | 74.31 | 0.362 |
| Baseline | Reward | Tx | Surv. | JFI |
|---|---|---|---|---|
| Fixed † | +306.8% | +338.5% | −3.9% | — |
| Round-Rob. | +6.7% | +0.7% | +25.7% | −14.8% |
| Closest | −6.2% | +2.2% | +58.3% | −5.0% |
| RL (PPO) | +11.4% | +0.8% | +15.4% | +89.5% |
| Fixed | RR | Closest | RL | RL+Guard | Gap vs. RR | Gap vs. Cls | |
|---|---|---|---|---|---|---|---|
| Survival rate (%) | |||||||
| 0.1 | 68.89 | 67.33 | 66.11 | 67.33 | 68.78 | ||
| 0.2 | 68.89 | 65.56 | 62.44 | 65.11 | 68.56 | ||
| 0.3 | 68.89 | 64.78 | 59.11 | 62.11 | 68.22 | ||
| 0.5 | 68.89 | 59.67 | 50.33 | 59.67 | 67.44 | ||
| 0.7 | 68.89 | 56.00 | 45.33 | 58.33 | 67.33 | ||
| 0.9 | 68.89 | 52.44 | 41.78 | 57.22 | 66.22 | ||
| Mean total reward | |||||||
| 0.1 | −1162 | 58 | 102 | 43 | 218 | — | — |
| 0.5 | −2636 | 3952 | 4394 | 3529 | 4360 | — | — |
| 0.9 | −4080 | 7895 | 9034 | 7472 | 8457 | — | — |
| Transmission success (%) | |||||||
| 0.1 | 22.84 | 99.58 | 99.40 | 99.43 | 99.64 | — | — |
| 0.5 | 22.57 | 99.25 | 98.33 | 98.87 | 99.55 | — | — |
| 0.9 | 22.76 | 98.84 | 97.20 | 98.63 | 99.44 | — | — |
| Metric | Block 42 | Block 142 | Block 242 | Block 342 | Mean | Stdev |
|---|---|---|---|---|---|---|
| Survival rate (%) | 66.47 | 65.27 | 66.33 | 62.47 | 65.13 | 1.86 |
| Mean total reward | 8393 | 8338 | 8226 | 7897 | 8214 | 222 |
| Tx success (%) | 99.46 | 99.38 | 99.35 | 99.26 | 99.36 | 0.08 |
| Final alive nodes | 19.94 | 19.58 | 19.90 | 18.74 | 19.54 | 0.56 |
| Method | Survival (%) | Tx Success (%) | Reward | Load-Balance CV |
|---|---|---|---|---|
| RL (unconstrained PPO) | 57.60 | 98.67 | 7532 | 2.06 |
| Lagrangian-relaxed PPO | 56.60 | 98.15 | 7836 | 1.94 |
| RL+Guard (proposed) | 66.47 | 99.46 | 8393 | 1.33 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Nazamdin, I.; Reid, A. Safety-Constrained Reinforcement Learning for Energy-Aware Transmission Scheduling in Seismic Wireless Sensor Networks. Sensors 2026, 26, 3542. https://doi.org/10.3390/s26113542
Nazamdin I, Reid A. Safety-Constrained Reinforcement Learning for Energy-Aware Transmission Scheduling in Seismic Wireless Sensor Networks. Sensors. 2026; 26(11):3542. https://doi.org/10.3390/s26113542
Chicago/Turabian StyleNazamdin, Isa, and Alistair Reid. 2026. "Safety-Constrained Reinforcement Learning for Energy-Aware Transmission Scheduling in Seismic Wireless Sensor Networks" Sensors 26, no. 11: 3542. https://doi.org/10.3390/s26113542
APA StyleNazamdin, I., & Reid, A. (2026). Safety-Constrained Reinforcement Learning for Energy-Aware Transmission Scheduling in Seismic Wireless Sensor Networks. Sensors, 26(11), 3542. https://doi.org/10.3390/s26113542

