Next Article in Journal
MnO2 Nanostructure-Based Novel Sensing: A Review
Previous Article in Journal
Predefined-Time Sliding Mode Control of Robotic Manipulators via Artificial Delay Feedback and Reinforcement Learning
Previous Article in Special Issue
Robust Target Association Method with Weighted Bipartite Graph Optimal Matching in Multi-Sensor Fusion
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Safety-Constrained Reinforcement Learning for Energy-Aware Transmission Scheduling in Seismic Wireless Sensor Networks

School of Engineering, Cardiff University, Queen’s Buildings South Building, Room S/2.46, 5 The Parade, Newport Road, Cardiff CF24 3AA, UK
*
Author to whom correspondence should be addressed.
Sensors 2026, 26(11), 3542; https://doi.org/10.3390/s26113542
Submission received: 8 April 2026 / Revised: 8 May 2026 / Accepted: 29 May 2026 / Published: 3 June 2026

Abstract

Wireless sensor networks (WSNs) deployed for seismic monitoring must sustain long-term operation under strict energy constraints, where premature node failure degrades spatial coverage and detection reliability. This paper presents a safety-constrained reinforcement learning framework for transmission scheduling in energy-harvesting seismic WSNs. The proposed approach integrates Proximal Policy Optimisation (PPO) with action masking and a runtime guard-layer safety filter that enforces battery-preservation and load-balancing constraints without retraining. The guard layer intercepts policy actions and substitutes safe alternatives when constraint violations are detected, using a scoring function that combines battery headroom with network-wide load equity. Experiments across three network scales (10, 15, and 30 nodes) with solar energy harvesting demonstrate that the guard-enhanced PPO achieves 99.46% transmission success at 30 nodes while maintaining 66.47% node survival—a 58.3% improvement in survival over the highest-reward baseline (Closest) at the cost of only a 6.2% reduction in cumulative reward. Crucially, the guard-enhanced policy outperforms the unconstrained PPO baseline simultaneously on cumulative reward ( + 11.4 % ), transmission success ( + 0.8 pp), and node survival ( + 15.4 % ), demonstrating that hard safety constraints, when properly aligned with the system’s energy model, provide both performance and safety gains rather than a fundamental trade-off. Sensitivity analysis across event rates ( p event = 0.5 and 0.9 ) confirms that the guard layer’s advantage persists under both moderate and extreme monitoring conditions. Analysis across scales reveals distinct operational regimes: at 10 nodes, heuristic baselines are near-optimal; at 30 nodes, learned policies dominate, and safety filtering becomes critical for sustained operation.

1. Introduction

Wireless sensor networks form the backbone of distributed environmental monitoring, enabling continuous observation of seismic, meteorological, and structural phenomena across geographically dispersed regions [1,2]. In seismic monitoring applications, sensor nodes must detect ground motion events and transmit data to a central controller while operating under severe energy constraints imposed by battery capacity and the intermittency of ambient energy harvesting [3,4]. The fundamental challenge is to maximise transmission reliability ensuring that detected seismic events are reported promptly—while preserving network longevity by preventing premature node depletion.
Traditional approaches to transmission scheduling in WSNs rely on hand-crafted heuristics. Fixed scheduling directs all traffic through a single designated node, leading to rapid depletion and a single point of failure. Round-robin policies distribute load uniformly but ignore spatial proximity and energy heterogeneity [5]. Proximity-based policies such as closest-node selection minimise per-transmission energy cost but concentrate load on geographically favourable nodes, ultimately reducing their lifetime [6,7]. None of these heuristics adapt to the dynamic interplay between stochastic event arrivals, spatially varying solar irradiance, and progressive battery degradation. The deployment scenario motivating this work is the seismically active Indian subcontinent, where terrain inaccessibility, lack of grid infrastructure, and extreme seasonal variation in solar irradiance make energy-autonomous sensor operation essential. India’s diverse seismic zones—spanning the northern collision boundary, the Andaman subduction arc, and the stable craton of peninsular India—present monitoring challenges where sparse node deployments must sustain long-term autonomous operation with no possibility of manual battery replacement.
Reinforcement learning (RL) offers a data-driven alternative in which a policy is learned directly from interaction with the environment, adapting to non-stationary dynamics without explicit modelling [8]. Recent work has applied RL to WSN routing and resource allocation [9], but deployment in safety-critical monitoring networks raises a concern: unconstrained RL policies may learn to concentrate transmissions on high-reward nodes, inadvertently depleting them and causing network fragmentation. This motivates the integration of hard safety constraints that guarantee battery preservation and load equity regardless of the learned policy’s behaviour.
This paper makes the following contributions:
  • A guard-layer safety filter that operates as a runtime constraint enforcement mechanism, intercepting actions from the learned policy and substituting safe alternatives when battery or load-balance constraints are violated. The guard layer requires no retraining and can be configured independently of the RL agent.
  • An action-masked PPO formulation for multi-node transmission scheduling that handles variable network topology through fixed-size observation padding and dynamic action masking, enabling a single trained model to operate across networks of different sizes.
  • A comprehensive empirical evaluation across three network scales (10, 15, and 30 nodes) with realistic solar energy harvesting, demonstrating that the guard-enhanced PPO achieves superior transmission success and survival compared to four heuristic baselines.

2. Related Work

2.1. Energy-Aware WSN Protocols

Energy management in WSNs has been studied extensively. LEACH [6] introduced cluster-based hierarchical routing with randomised cluster-head rotation to distribute energy load. PEGASIS [7] extended this to chain-based topologies, reducing the number of direct-to-sink transmissions. Energy-efficient MAC protocols such as S-MAC [10] employ duty cycling to reduce idle listening power. While effective in static deployments, these protocols rely on predetermined schedules and do not adapt to dynamic event distributions or heterogeneous energy harvesting conditions.

2.2. Reinforcement Learning for WSN Optimisation

The application of RL to WSN management has gained traction as network complexity has increased. Li [8] surveyed deep RL methods applicable to sequential decision-making in networked systems. PPO [11], combined with Generalised Advantage Estimation (GAE) [12], has emerged as a stable policy gradient method suitable for high-dimensional continuous and discrete action spaces. Action masking [13] has been shown to improve sample efficiency in constrained environments by preventing the agent from selecting infeasible actions during training.
Recent advances have applied deep RL to energy management and scheduling in EH-WSNs. Guo et al. [14] applied Q-learning to routing path optimisation for network lifetime extension using a DQN-based dual-mode scheme in rechargeable WSNs, demonstrating measurable survival gains over heuristic routing. Barat et al. [15] addressed cooperative energy management in harvesting WSNs using an actor-critic RL formulation but did not consider transmission scheduling safety under progressive node depletion. A parallel line of work [16] applied deep RL to peer-to-peer communication energy optimisation in WSNs, further validating the applicability of DRL frameworks to sensor network resource allocation. Alagha et al. [17] demonstrated data-driven active node selection for event localisation in IoT radiation monitoring, achieving improved detection with reduced energy consumption but without explicit safety constraints. Wang et al. [18] applied multi-agent deep RL to task offloading in group M2M communications, demonstrating scalability of RL-based resource allocation to multi-device scenarios. Luong et al. [19] surveyed deep RL applications in communications and networking, identifying energy-harvesting IoT scheduling as a key open problem. Liu et al. [20] addressed long-term network resource management using multi-agent RL with reward shaping, demonstrating improved network lifetime but without runtime safety enforcement. The present work differs from these contributions in three key respects. First, it targets sparse seismic monitoring with detection-radius-constrained transmission scheduling, a problem structure not addressed in prior DRL-WSN work. Second, it introduces a runtime guard-layer safety filter that enforces hard battery-preservation and load-balance constraints without modifying the trained policy or requiring retraining—a mechanism not present in any of the cited DRL-WSN approaches. Third, it provides empirical evidence across three network scales and a full-year simulation horizon, establishing the scale-dependent regimes under which RL-based scheduling provides a measurable advantage over heuristic baselines.

2.3. Safe Reinforcement Learning

Safe RL encompasses methods that enforce constraints during or after learning [21]. Constrained Markov Decision Processes (CMDPs) [22] formalise safety as auxiliary cost constraints on the expected cumulative cost. Constrained Policy Optimisation (CPO) [23] extends trust-region methods to enforce constraints at each policy update. Shielding [24] provides an alternative: a runtime monitor intercepts unsafe actions and replaces them with safe alternatives, decoupling safety enforcement from the learning objective. More recently, Gu et al. [25] provided a comprehensive review of safe RL methods, categorising approaches into constrained optimisation, shielding, and Lyapunov-based methods. Xiao et al. [26] proposed BarrierNet, integrating control barrier functions directly into neural network layers to provide differentiable safety guarantees. The guard layer proposed in this work draws conceptually from shielding, applying constraint filtering as a post hoc safety mechanism that does not modify the learned policy. Unlike BarrierNet’s differentiable approach, the guard layer operates as a non-differentiable runtime filter with interpretable, domain-expert-configurable parameters—a design choice motivated by the need for auditability in safety-critical infrastructure deployment.

3. System Model

3.1. Network Architecture

Consider a WSN of N sensor nodes deployed over a 1000 km × 1000 km monitoring region. Each node i { 1 , , N } is located at coordinates ( x i , y i ) and is equipped with a seismic sensor, a radio transceiver, a rechargeable lithium-ion battery with capacity E max , and a solar energy harvesting panel. A centralised controller receives transmissions from nodes and issues scheduling commands at hourly intervals. Each node sends a status update (battery level, alive status, event detection) to the controller every time step, incurring a state-report energy cost.
With a detection radius of R detect = 250  km, each node provides geometric coverage of approximately π R 2 196,350 km2. For a 1000 × 1000 km2 monitoring region, full spatial coverage is theoretically achievable with as few as five to six nodes; the evaluated scales of N { 10 , 15 , 30 } therefore represent progressively redundant deployments that prioritise energy resilience over minimal coverage. This sparse node density is consistent with real large-area seismic monitoring infrastructure deployed across the Indian subcontinent: the National Seismological Network operated by the India Meteorological Department (IMD) covers the seismically active zones of peninsular India, the northern arc, and the Andaman–Nicobar arc using a comparable node count per unit area [27]. The 30-node configuration evaluated here represents a realistic upper bound for a single-cluster deployment in such a region, capturing the scheduling complexity of energy-constrained management without requiring simulation at the scale of a national network.

3.2. Energy Model

The battery dynamics of node i at time step t are governed by
E i ( t + 1 ) = min E max , E i ( t ) C idle C tx , i ( t ) C sr + H i ( t )
where C idle = P idle · Δ t is the idle energy consumption per step, C tx , i ( t ) is the transmission cost (nonzero only when node i is selected for transmission), C sr is the state-report cost per step, and  H i ( t ) is the energy harvested from solar irradiance.
The idle power consumption is P idle = 2.5  mW, representative of a duty-cycled node architecture combining a low-power MEMS accelerometer (e.g., ADXL355, approximately 0.7 mW in normal operation), a microcontroller unit operating in deep-sleep mode (approximately 0.03 mW), and a LoRa transceiver in standby mode (approximately 0.005 mW), with periodic wake cycles for sensing and state reporting. This power budget reflects energy-harvesting sensor nodes designed for remote deployment in harsh environments, consistent with low-power seismic monitoring hardware documented in the literature [3]. The resulting idle cost is C idle = 9  J per hourly step. The base transmission cost is C tx , base = 3.0  J, subject to cumulative stress degradation:
C tx , i ( t ) = C tx , base · 1 + γ stress · n tx , i ( t )
where γ stress = 0.0015 and n tx , i ( t ) is the cumulative transmission count of node i up to time t. Battery aging is modelled as capacity fade at rate γ age = 5 × 10 5 per discharge cycle.
Solar energy harvesting follows
H i ( t ) = η · A · G i ( t ) · Δ t
where η = 0.12 is the panel efficiency, A = 1.2   cm 2 is the panel area, and  G i ( t ) is the solar irradiance at the node’s location and time, obtained from PVGIS data [28]. The panel dimensions ( η = 0.12 , A = 1.2   cm 2 ) reflect a compact, form-factor-constrained node designed for covert or minimally invasive deployment on terrain features in remote Indian seismic zones where larger structures are impractical. Energy drawn during transmission is supplied from the lithium-ion buffer battery rather than directly from the panel, consistent with energy-neutral sensor node architectures in which harvested energy is stored and then consumed in bursts [29]. Under peak irradiance ( G = 800   W / m 2 ), harvesting yields approximately 49.8 J/h; the daily average is approximately half this value due to diurnal variation.
A node is declared dead when E i ( t ) 0 , after which it can neither sense nor transmit. The state-report cost is modelled as C sr = 0.1 · C tx , base , representing the overhead of hourly status reporting to the controller.
On-node detection energy. The energy budget above includes idle, transmission, and state-report costs but treats local seismic detection (the algorithm that decides whether an event is currently observable) as folded into the idle term. Concretely, we assume each node runs a classical short-term-average/long-term-average (STA/LTA) trigger [30] continuously on the MEMS-accelerometer stream. Published measurements for STA/LTA on low-power MCUs (Cortex-M4 class) report an additional energy draw of 0.3 0.6  mW averaged over a duty-cycled implementation [31], contributing approximately 1.1 2.2  J per hourly step—i.e., 12–24% of the C idle = 9.0  J/step total budget already accounted for in this model. Heavier ML-based detectors (e.g., compact CNN or template-matching variants) would add a further 5– 15 % [32], which would raise P idle by a comparable fraction without changing the qualitative scheduling problem. The RL+Guard scheduler operates on the remaining transmission and state-report budget, so the energy savings reported in Section 7 apply on top of—not in lieu of—continuous on-node sensing and detection. Table 1 summarises all energy parameters.

3.3. Event Model

Seismic events occur stochastically at each time step with probability p event = 0.9 , representing a high-activity monitoring scenario. This high event rate is a deliberate stress-test condition designed to maximise scheduling pressure and differentiate policy performance; lower event rates reduce the frequency of scheduling decisions and tend to favour all policies equally, obscuring the comparative advantage of safety-constrained scheduling. Each event has a location ( x e , y e ) drawn uniformly from the monitoring region. A node i can detect the event if and only if
( x i , y i ) ( x e , y e )   R detect
where R detect = 250  km is the detection radius. This constraint ensures that only a subset of nodes can report any given event, making the scheduling decision non-trivial.

3.4. Energy Balance Analysis

The feasibility of perpetual network operation depends on the energy balance at each node. Under nominal conditions (no transmission), a node’s net hourly energy is
Δ E net = H ¯ C idle C sr = 24.9 9.0 0.3 = 15.6 J / h
where H ¯ 24.9  J/h is the daily average solar harvest. Thus, non-transmitting nodes accumulate energy at approximately 15.6 J per hour, ensuring long-term survival. However, a node selected for transmission incurs an additional cost of C tx 3.0  J. If a node transmits every hour, its net balance becomes 15.6 3.0 = 12.6  J/h—still positive under average conditions but marginal during low-irradiance periods (winter, cloudy days) when H i ( t ) 0 for multiple consecutive hours. The cumulative stress growth (2) further increases the per-transmission cost with usage, creating a positive feedback loop: heavily used nodes incur progressively higher transmission costs, accelerating depletion. This nonlinear coupling between usage and cost is the primary mechanism by which naive policies (Fixed, unconstrained RL) cause premature node death.

3.5. Communication Model

The system operates in a centralised, time-slotted architecture. At each hourly time step, the following sequence occurs: (i) all alive nodes transmit a status report (battery level, alive flag, local event detection) to the central controller; (ii) the controller aggregates observations and runs the scheduling policy; (iii) the controller issues a single command designating one node for event transmission or issuing a no-transmit directive; (iv) the designated node (if any) transmits event data to the controller. This hold-last-command protocol means that inter-step events are reported at the next time step, introducing a maximum one-hour reporting latency.

4. Reinforcement Learning Formulation

4.1. Markov Decision Process

The transmission scheduling problem is formulated as a Markov Decision Process (MDP) S , A , P , R , γ with the following components.
State space  S : At each time step t, the state s t encodes, for each cluster c { 1 , , C } : the number of alive nodes n c alive , the mean battery level E ¯ c , and the distance from the cluster centroid to the current event d c . An event flag f t { 0 , 1 } indicates whether a detectable event is active. The full observation is
s t = n 1 alive , E ¯ 1 , d 1 , , n C alive , E ¯ C , d C , f t R 3 C + 1
Action space  A : The agent selects an action a t { 0 , 1 , , N 1 , N } where actions 0 to N 1 correspond to selecting a specific node for transmission and action N denotes no transmission.
Reward function: The reward r t is designed to balance event reporting, energy conservation, and network longevity:
r t = r tx + r energy + r alive + r balance + r death
where
  • r tx = + 2.0 for correct transmission (event active, node detects it), 2.0 for incorrect transmission (wrong node or dead node), 1.0 for missed event;
  • r energy = 0.2 per transmission (energy conservation penalty) and 2 × 10 6 · d i (distance penalty favouring proximal nodes);
  • r alive = + 0.01 per alive node (survival incentive);
  • r balance = 0.03 · σ ( E t ) where σ ( E t ) is the standard deviation of battery levels (load-balancing incentive);
  • r death = 5.0 per dead cluster, 20.0 if all nodes are dead.
Rationale for r tx asymmetry. The penalty for an incorrect transmission ( 2.0 ) is set higher in magnitude than the penalty for a missed event ( 1.0 ) for two reasons grounded in the energy-constrained nature of seismic monitoring. First, an incorrect transmission has a compounding cost: it consumes the per-transmission energy budget ( C tx , base = 3.0  J plus distance-dependent overhead) without yielding a useful detection, and that expenditure directly reduces the number of future events that can be reported before node depletion. A missed event, by contrast, is a one-shot loss; it does not deplete the network’s capacity to report subsequent events. Second, the action mask already disallows transmission from dead or out-of-range nodes, so an incorrect transmission represents a policy error (selecting a non-detecting node despite eligibility), which the agent should learn to avoid more aggressively than a conservation choice (no-transmit during an event). The asymmetric weighting therefore treats false positives—which both waste energy and undermine deployment-side trust in the alarm channel—as the more harmful failure mode in expectation. We verified empirically that swapping the weights ( r tx , wrong = 1.0 , r miss = 2.0 ) produced 8.4 pp lower survival at p event = 0.9 over a 25-episode pilot. The 25-episode pilot was used given the directional rather than headline nature of this check; the observed 8.4 pp gap substantially exceeds the between-block standard deviation reported in Section 7.9 ( 1.86  pp), confirming statistical significance and supporting the conclusion that the chosen asymmetry is well aligned with the longevity objective.
Action masking: At each step, the environment computes a boolean mask m t { 0 , 1 } | A | where m t , a = 1 if and only if the corresponding node is alive, has sufficient battery for transmission, and (if an event is active) is within the detection radius. The no-transmit action is always valid. The policy is restricted to select from { a : m t , a = 1 } .

4.2. PPO with Generalised Advantage Estimation

The policy is parameterised as a neural network π θ ( a | s ) with a shared feature extraction backbone feeding separate actor and critic heads. The actor outputs a categorical distribution over masked actions; the critic estimates the state value V ϕ ( s ) .
Training follows Proximal Policy Optimisation [11] with the clipped surrogate objective:
L PPO ( θ ) = E t min r t ( θ ) A ^ t , clip ( r t ( θ ) , 1 ϵ , 1 + ϵ ) A ^ t
where r t ( θ ) = π θ ( a t | s t ) / π θ old ( a t | s t ) is the probability ratio and ϵ = 0.2 is the clipping parameter. Advantages are estimated via GAE [12]:
A ^ t = l = 0 ( γ λ ) l δ t + l , δ t = r t + γ V ( s t + 1 ) V ( s t )
with discount factor γ = 0.99 and trace decay λ = 0.95 . The critic is trained to minimise the mean squared value prediction error.

5. Proposed Architecture

5.1. Network Architecture and Action Masking

The policy network consists of a three-layer MLP with hidden dimensions [256, 256, 128] and ReLU activations. The shared layers extract features from the padded observation vector s t R 3 C max + 1 , where C max is the maximum supported cluster count. The actor head produces logits over N max + 1 actions; invalid actions receive logits via the action mask, ensuring zero probability under the softmax. The critic head outputs a scalar value estimate.
To support variable network sizes with a single trained model, observations are zero-padded to length 3 C max + 1 and the action mask disables indices corresponding to non-existent or dead nodes. This fixed-size wrapper approach avoids retraining when the topology changes within the bounds defined by C max and M max .

5.2. System Pipeline

The overall system pipeline operates in two phases: offline training and online inference. During training, the PPO agent interacts with the simulation environment, receiving observations and action masks, and optimising the clipped surrogate objective. During inference, the trained policy processes real-time observations and produces candidate actions, which are then filtered by the guard layer before execution. Figure 1 illustrates the inference-time data flow.
The critic head is used only during training for advantage estimation and is discarded at inference. The guard layer introduces no additional learnable parameters; its behaviour is fully determined by the configuration parameters τ batt , α , and  ρ t .

5.3. Guard-Layer Safety Filter

The guard layer operates as a runtime constraint enforcement mechanism that intercepts actions produced by the learned policy and applies hard constraints before execution. It does not modify the policy parameters and can be enabled or disabled independently.

5.3.1. Safety Score Computation

Given a state s t with battery levels E t and the set of eligible nodes E t (alive, sufficient battery, event-detectable), the guard layer computes a safety score for each eligible node i:
score i = E i , t τ batt + α E ¯ t E i , t
where τ batt is the battery safety threshold, E ¯ t = mean ( E t ) is the network-wide mean battery, and  α = 0.2 is the load-balancing weight. The first term ensures that selected nodes maintain sufficient battery headroom; the second term penalises nodes with above-average battery, thereby preferring under-utilised nodes and promoting equity.

5.3.2. Action Filtering with Relaxation

At runtime, the guard layer receives the policy’s action a t . If  a t selects a node i E t with E i , t > ρ t · τ batt , the action passes through unchanged. Otherwise, the guard substitutes the highest-scoring eligible node:
a safe = a t if feasible , E i , t > ρ t τ batt arg   max j E t   score j otherwise
The relaxation factor ρ t = 1 + 0.5 · min ( t / T relax , 1 ) allows the threshold to increase from τ batt to 1.5 · τ batt over the first T relax = 100,000 training steps. This progressive tightening permits early exploration while enforcing increasingly strict safety as training converges.

6. Experimental Setup

6.1. Simulation Environment

Experiments were conducted using a custom discrete-event simulator implementing the system model of Section 3. Nodes were deployed over a 1000 km × 1000 km region with coordinates sampled from pre-defined cluster configurations. Three network scales were evaluated: N { 10 , 15 , 30 } nodes. Each simulation episode spanned T max = 8760 steps (one year at hourly resolution). Solar irradiance data were obtained from the PVGIS database [28] for a representative location in northern India (approximately 28.5° N, 77.2° E), a region characterised by high peak summer irradiance, pronounced seasonal variation between monsoon and dry-season periods, and topographic shading effects that reduce effective harvest below clear-sky estimates. This geographic specificity ensures that the energy model reflects realistic harvesting conditions for the target deployment environment rather than idealised or European solar profiles.
The battery capacity was set to E max = 15,000  J. The guard-layer threshold was τ batt = 500  J (3.3% of capacity), with relaxation over T relax = 100,000 steps and load-balancing weight α = 0.2 . The detection radius was R detect = 250  km.

6.2. Training Configuration

The PPO agent was trained using Stable Baselines3 [33] with action masking from SB3-Contrib. Training hyperparameters are summarised in Table 2.

6.3. Guard-Layer Hyperparameter Selection

Guard-layer hyperparameters were selected on the basis of energy-balance analysis and a sensitivity sweep performed on the 30-node configuration (Table 3, Figure 2). The battery safety threshold τ batt = 500  J (3.3% of E max = 15,000 J) was chosen to guarantee that a selected node retains sufficient charge for at least one full transmission cycle ( C tx , base = 3.0  J) plus continuous idle consumption for approximately 55 h at worst-case conditions ( C idle = 9.0  J/h). This provides a meaningful buffer against consecutive low-irradiance periods, which occur during monsoon season in the target deployment region.
Load-balancing weight α (rank-invariance and empirical sweep). The scoring function (10) admits the algebraic rewrite
score ( i ) = ( 1 α ) E i + α E ¯ eligible τ batt ,
from which it follows that for any pair of eligible candidates ( i , j ) and any α [ 0 , 1 ) ,
score ( i ) score ( j ) = ( 1 α ) ( E i E j ) ,
i.e., the ranking induced by the score is invariant under α over the entire interval [ 0 , 1 ) and is determined solely by battery level. Only at α = 1 does the score collapse to E ¯ eligible τ batt , a constant across candidates, at which point tie-breaking falls to enumeration order and the load-balancing benefit of the guard degrades.
The manuscript’s choice α = 0.2 therefore lies in a continuous robust interval of width 1.0 , not a tuned operating point. Empirically and analytically, RL+Guard performance is insensitive to α throughout this interval; the same survival rate (65.47%), transmission success (99.34%), and load-balance coefficient of variation (1.332) are obtained for α { 0.0 , 0.1 , 0.2 , 0.3 , 0.5 } .
Relaxation schedule T relax . The relaxation schedule T relax = 100,000 steps corresponds approximately to 20% of total training, permitting unrestricted exploration during the early phase before safety enforcement progressively tightens. This schedule prevents the guard layer from over-constraining early policy exploration, during which high-variance action selection is beneficial for escaping local optima. The schedule was selected on the basis of training-time PPO loss-curve convergence: episode reward plateaued by approximately step 80,000 and remained stable thereafter, indicating that the relaxation horizon spans the curriculum-driven exploration phase. Because  T relax is a training-time parameter, evaluating its sensitivity at inference time would require independent retraining runs at each candidate value, which is computationally prohibitive within the present study; the convergence evidence above and the analytical rank-invariance result for α together provide the primary robustness justification for the chosen guard-layer parameterisation.

6.4. Baseline Policies

Four heuristic baselines were implemented for comparison, representing canonical scheduling strategies from the WSN literature:
Fixed Policy. The Fixed policy always designates a single predetermined node (cluster 0, node 0) for all transmissions. Before transmitting, it verifies that the node is alive, has battery exceeding the transmission cost, and (if an event is active) can detect the event. If any condition fails, no transmission occurs. This policy represents the worst-case scenario for load distribution—all energy expenditure is concentrated on one node, while remaining nodes idle indefinitely. It serves as a lower bound on scheduling intelligence.
Round-Robin Policy. The Round-Robin policy maintains a global pointer that cycles through all N nodes. At each step, the pointer advances to the next eligible node (alive, sufficient battery, event-detectable). If the current pointer node is ineligible, the policy scans forward until an eligible node is found or all nodes have been checked, at which point a no-transmit action is returned. This policy provides perfect temporal fairness—over a long horizon, each node receives an equal share of transmissions—but ignores spatial proximity and energy heterogeneity.
Closest Policy. The Closest policy selects the eligible node with the smallest Euclidean distance to the current event epicentre, breaking ties by selecting the node with the highest remaining battery. By minimising transmission distance, this policy implicitly minimises per-transmission energy cost (assuming distance-dependent attenuation), making it the most energy-efficient on a per-event basis. However, geographic clustering of events causes repeated selection of spatially proximal nodes, leading to localised depletion.
RL (PPO). The unconstrained PPO agent operates without the guard-layer filter, serving as an ablation baseline that isolates the contribution of safety constraints. This policy learns from the same reward signal and state representation as the guard-enhanced variant but may select actions that violate battery or load-balance constraints.
The fifth policy, RL+Guard, applies the guard-layer filter of Algorithm 1 to the PPO agent’s output. Any action that violates battery or balance constraints is replaced by the highest-scoring eligible node under the guard’s scoring function (10).
Algorithm 1 Guard-Layer Action Filtering
  1:
Input: state s t , policy action a t , threshold τ batt , relaxation ρ t
  2:
Output: safe action a safe
  3:
E t  GetEligibleNodes ( s t )
  4:
if  E t =  then
  5:
    return no-transmit
  6:
end if
  7:
i * Decode(at) {extract node index}
  8:
τ * ρ t · τ batt {relaxed threshold}
  9:
if  i * E t  and  E i * , t > τ *  then
10:
    return  a t {action is safe}
11:
end if
12:
for each i E t  do
13:
     score i ( E i , t τ * ) + α ( E ¯ t E i , t )
14:
end for
15:
return Encode ( arg max i score i )

6.5. Evaluation Metrics

Performance was assessed across five primary metrics: (i) mean total reward, the cumulative reward over the episode averaged across runs; (ii) transmission success rate, the percentage of detectable events for which a valid transmission was executed; (iii) node survival rate, the percentage of nodes alive at episode end; (iv) load-balance Jain’s Fairness Index (JFI) [34], computed over final battery levels across all nodes, where values approaching 1.0 indicate perfectly equitable energy distribution and values approaching 1 / N indicate all load concentrated on a single node; and (v) mean final battery, the average residual battery across all nodes at episode termination. Each configuration was evaluated over 50 independent episodes with different random seeds. To verify that 50 episodes is statistically sufficient—and to characterise long-tail edge-case behaviour from prolonged low-irradiance periods or unfavourable event clustering—the headline 30-node, p event = 0.9 configuration was additionally replicated across four disjoint 50-episode seed blocks (base seeds 42, 142, 242, 342; 200 effective episodes total). The between-block variance and pooled per-episode percentile distribution are reported in Section 7.9. Performance statistics are reported as mean ± 95% confidence interval computed using the t-distribution with 49 degrees of freedom ( t 0.025 , 49 = 2.010 ), providing statistically robust estimates with N = 50 replications per primary configuration. Training was conducted on a single NVIDIA GPU. Each training run comprised 2,000,000 total environment interaction steps across episodes of T max = 8760 steps each (approximately 228 full episodes), requiring approximately 4–6 h per configuration. Training employed curriculum learning with domain randomisation: the agent was exposed to varied network configurations and stochastic conditions during training to improve generalisation. A linear learning rate decay from 5 × 10 4 to 5 × 10 6 was applied over the training horizon. Convergence was verified by monitoring the mean episode reward over the final 500,000 training steps: the policy’s rolling reward plateaued with less than 2% variation over the last 25% of training, indicating stable convergence. Inference-time overhead from the guard layer is negligible: the scoring function (10) involves O ( N ) arithmetic operations per step and adds less than 0.1 ms per decision, suitable for real-time hourly scheduling on resource-constrained hardware.

7. Results

7.1. Aggregate Performance Across Scales

Table 4 presents the aggregate performance metrics for all five policies across three network scales. Several findings emerge.
At 10 nodes, the network is sufficiently over-provisioned that all policies except Fixed achieve 100% transmission success and 100% survival. The RL+Guard policy achieves a reward comparable to that of Closest ( 11 , 278 ± 68 vs. 11 , 770 ± 70 ; confidence intervals overlap) and all policies achieve near-perfect load fairness (JFI ≈ 1.000). The small network permits balanced load distribution regardless of scheduling strategy.
At 15 nodes, energy constraints become binding, and node deaths occur across all policies. Closest achieves the highest reward ( 5961 ± 328 ) but only 34.80% survival, indicating aggressive exploitation of proximal nodes. The RL+Guard policy achieves the highest survival at 62.93% while maintaining the highest transmission success (97.58%), though confidence intervals partially overlap across policies due to the inherent variability at this scale. Load fairness degrades substantially: JFI drops to 0.186–0.596 across policies, with the Fixed policy maintaining the highest fairness (JFI = 0.596) and RL (PPO) the lowest (JFI = 0.186).
At 30 nodes, the RL+Guard policy achieves the highest transmission success (99.46%) and the highest survival rate (66.47%). While Closest achieves a higher cumulative reward ( 8944 ± 190 vs. 8393 ± 186 ), this comes at the cost of only 42.00% survival—a 58.3% reduction relative to RL+Guard. Fairness analysis reveals that RL+Guard achieves JFI = 0.362 compared to Closest (JFI = 0.381), indicating a slight fairness trade-off: the guard layer’s battery-threshold scoring concentrates load on high-battery nodes rather than distributing it with maximal statistical equity, but this trade-off yields substantially higher survival. Crucially, RL+Guard surpasses unconstrained PPO in reward ( 8393 ± 186 vs. 7532 ± 233 , + 11.4 % ), transmission success ( 99.46 % vs. 98.67 % ), and survival ( 66.47 % vs. 57.60 % , + 15.4 % ), demonstrating that the guard layer simultaneously improves performance and safety at this scale.

7.2. Cumulative Reward Comparison

Figure 3 shows the mean total reward for each policy across all three scales. Fixed consistently achieves negative reward due to missed events (single-node coverage is limited to R detect = 250 km). At 10 nodes, RL+Guard and Closest are statistically indistinguishable. At 30 nodes, a clear hierarchy emerges: Closest > RL+Guard > Round-Robin > RL (PPO) ≫ Fixed. Notably, the guard-enhanced policy outperforms unconstrained PPO by + 11.4 % at 30 nodes and by + 52.5 % at 15 nodes ( 4634 ± 273 vs. 3038 ± 324 ), confirming that the guard layer’s battery-preserving substitutions consistently improve long-run reward by preventing the early node deaths that reduce coverage and incur death penalties.

7.3. Transmission Success and Node Survival

Figure 4 and Figure 5 present survival rate and transmission success, respectively. The survival results reveal a critical insight: policies optimising for reward (Closest) achieve the lowest survival at 30 nodes (42.00%), while the safety-constrained RL+Guard achieves the highest (66.47%). This +58.3% improvement in survival comes with a modest −6.2% reduction in cumulative reward relative to Closest.
The transmission success results at 30 nodes are particularly noteworthy: RL+Guard achieves 99.46%, the highest among all policies, despite its energy-preserving node selection. This is because the guard layer’s load-balancing prevents the premature node deaths that would otherwise reduce the available detection coverage in later time steps.

7.4. Load-Fairness Analysis

Figure 6 quantifies load distribution via Jain’s Fairness Index (JFI) of final battery levels, where values approaching 1.0 indicate perfectly equitable distribution. At 10 nodes, all policies achieve near-perfect fairness (JFI ≈ 1.000), as the small network permits balanced load distribution. At 15 and 30 nodes, a fairness hierarchy emerges: the Fixed policy consistently achieves the highest JFI (0.596 at 15 nodes, 0.669 at 30 nodes) because its single-node transmission strategy leaves all other nodes at full charge. This apparent fairness arises because all non-transmitting nodes retain full charge, producing a bimodal battery distribution (one depleted node, N 1 full nodes) that Jain’s index interprets as high equity; this apparent equity does not reflect genuine load balance and should be interpreted alongside the survival metric. Among active scheduling policies, unconstrained RL (PPO) achieves the lowest fairness (JFI = 0.186 at both 15 and 30 nodes), reflecting severely unbalanced scheduling that concentrates transmission load on a small subset of nodes. The guard layer dramatically improves this: RL+Guard achieves JFI = 0.398 and 0.362 at 15 and 30 nodes, respectively—a +89.5% improvement over unconstrained PPO at 30 nodes. Among heuristic policies, Closest achieves JFI = 0.394 and 0.381 at 15 and 30 nodes, respectively, comparable to RL+Guard. The guard layer’s battery-threshold scoring selects substituted nodes for high battery headroom, which redistributes load away from depleted nodes and substantially improves distributional equity relative to the unconstrained policy. While RL+Guard achieves slightly lower JFI than Closest at 30 nodes (0.362 vs. 0.381), this modest fairness difference is offset by RL+Guard’s substantially higher survival (66.47% vs. 42.00%), confirming that the guard layer prioritises survival-critical battery headroom over maximal statistical equity—a deliberate design choice because premature depletion carries catastrophic penalties that strict equidistribution does not prevent.

7.5. Percentage Improvements at 30 Nodes

Table 5 quantifies the relative performance of RL+Guard against each baseline at 30 nodes. Compared to the highest-reward baseline (Closest), RL+Guard achieves a 58.3% improvement in node survival at the cost of a 6.2% reduction in cumulative reward. Crucially, RL+Guard outperforms the unconstrained PPO on all metrics simultaneously: reward ( 8393 ± 186 vs. 7532 ± 233 , + 11.4 % ), transmission success ( 99.46 % vs. 98.67 % , + 0.8 pp), and survival ( 66.47 % vs. 57.60 % , + 15.4 % ). The guard-enhanced policy demonstrates that hard safety constraints, properly aligned with Equation (10), can simultaneously improve both performance and safety without the expected reward–safety trade-off.
These results demonstrate that the guard-enhanced policy does not merely trade reward for safety—it simultaneously improves reward, transmission success, and node survival relative to unconstrained RL at all scales where energy constraints are binding. The improvement over Closest in survival ( + 58.3 % ) with only modest reward reduction ( 6.2 % ) further confirms that the guard layer navigates the reward–survival Pareto frontier toward substantially more sustainable operating points.

7.6. Temporal Dynamics at 30 Nodes

Figure 7 shows the evolution of alive nodes over the full simulation year for the 30-node network. The temporal dynamics reveal several phases of network operation.
Phase 1: Initialisation (steps 0–500). All policies maintain the full complement of 30 nodes. Solar harvesting charges batteries from initial conditions, and energy reserves are sufficient to absorb transmission costs without depletion. The Fixed policy’s designated node begins accumulating transmission stress but has not yet reached the depletion threshold.
Phase 2: Onset of depletion (steps 500–3000). The Fixed policy’s designated node fails around step 1500–2000, reducing its alive count. Other policies remain largely intact, though battery variance begins to increase as transmission load concentrates on subsets of nodes.
Phase 3: Steady-state decline (steps 3000–8760). Round-Robin and Closest show progressive node failure as the cumulative effect of transmission stress and seasonal irradiance variation depletes the most active nodes. The RL+Guard policy maintains higher alive counts throughout this phase, with approximately 19.9 nodes surviving to the end compared to 15.9 for Round-Robin and 12.6 for Closest. The guard layer’s load-balancing mechanism delays the onset of cascading failures by redistributing transmission load before nodes reach critical depletion.
Figure 8 presents the mean battery level over time. All policies show an initial charging phase (steps 0–500) as solar harvesting fills batteries from initial conditions. Subsequently, transmission load causes divergence. Closest maintains the highest mean battery because it minimises per-transmission energy, but this masks severe depletion of frequently selected nodes. RL+Guard shows lower mean battery due to more distributed usage but achieves better survival through equity.

7.7. Reward–Survival Trade-Off

Figure 9 maps each policy–scale combination onto the reward–survival plane. The ideal operating point is the upper right (high reward, high survival). At 10 nodes, all policies except Fixed cluster in this region. At 15 and 30 nodes, a Pareto frontier emerges: Closest occupies the high-reward/low-survival corner, while RL+Guard shifts toward balanced performance. No policy dominates across both axes at the larger scales, confirming that the reward–survival trade-off is fundamental and that the guard layer provides a mechanism for navigating this trade-off in favour of sustained operation.
Quantitative trade-off and consequences for seismic imaging. The trade-off Closest makes against RL+Guard is asymmetric: at the headline 30-node, p event = 0.9 configuration, Closest extracts + 577 additional reward ( + 6.4 % relative) by repeatedly selecting nodes near event epicentres, while RL+Guard achieves + 24.44 pp higher node survival ( 66.22 % vs. 41.78 % , a relative gain of + 58.5 % in surviving nodes). The same asymmetry holds across the entire p event sweep reported in Table 6. In a seismic monitoring context, this asymmetry is decisive. Source-localisation accuracy depends on the geometric distribution of surviving stations relative to the epicentre, not on the cumulative reward integrated over the deployment lifetime: a 24.4 pp survival gain corresponds to approximately 7.3 additional surviving nodes in the 30-node network, and these additional surviving nodes are distributed across the deployment area rather than concentrated near historically active epicentres. The Closest policy’s surviving nodes are precisely those that have been most heavily used—typically the spatial cluster nearest the dominant event source—so its surviving station geometry is biased and provides poor azimuthal coverage of less active regions. RL+Guard’s distributed survival pattern therefore preserves the triangulation geometry required for reliable epicentre estimation, which is the primary deployment objective for the wireless seismic network. The reward shortfall ( 6.4 % ) translates into approximately 577 reward units of unrealised reward over the deployment year, whereas the survival gain ( + 24.4 pp at 30 nodes) translates into approximately seven additional surviving nodes distributed across the deployment area at the end of the simulated year—preserving spatial coverage in regions that the Closest-node-only baseline has effectively abandoned through localised central-node depletion.

7.8. Sensitivity to Event Probability

To identify whether a break-even point exists at which the proposed framework loses its advantage over simpler heuristics in low-activity scenarios, we conducted a six-point sensitivity sweep over p event { 0.1 , 0.2 , 0.3 , 0.5 , 0.7 , 0.9 } on the 30-node configuration, holding all other parameters fixed (Table 6, Figure 10).
No break-even in the realistic range. The survival advantage of RL+Guard over the strongest heuristic baseline (Round-Robin) grows monotonically with p event : from + 1.45 pp at p = 0.1 to + 13.78 pp at p = 0.9 . Against Closest—the heuristic that maximises per-event reward—the gap is wider, from + 2.67 pp to + 24.44 pp over the same range. At no tested operating point does any heuristic overtake RL+Guard on survival, and RL+Guard achieves the highest transmission success at every point. The complexity of an RL-based controller is therefore justified across the entire realistic range of seismic event rates relevant to the target deployment.
Reward-vs.-survival trade-off across event rates. At low event rates ( p = 0.1 ), RL+Guard leads on both reward and survival; the surplus reward over the best heuristic is + 115 reward units. We note that this absolute surplus is large in proportional terms only because the sparse-event regime produces small total rewards across all four learning and heuristic policies (the four cluster in the narrow range 43–218); the practical advantage at low event rates is best characterised through the survival and transmission-success columns, where RL+Guard remains the highest-performing policy. At higher event rates ( p 0.2 ), Closest extracts marginally more reward by aggressively re-using nodes near event epicentres but at the cost of accelerated central-node depletion and corresponding survival loss. At p = 0.9 , the reward gap is 577 in absolute terms ( 6.4 % relative), while the survival gap is + 24.44 pp—a 58.5 % relative improvement in surviving nodes for a 6.4 % reward concession. For seismic monitoring, where post-event localisation requires triangulation across a spatially diverse set of surviving stations, this trade is overwhelmingly correct: a network with 66.22 % surviving nodes preserves the geometric coverage needed for accurate epicentre estimation, whereas 41.78 % surviving nodes from the Closest heuristic concentrates surviving stations near the most active epicentres and degrades coverage of the remainder of the monitored region.

7.9. Statistical Robustness and Long-Tail Behaviour

To address concerns about the statistical sufficiency of N = 50 episodes and to characterise long-tail edge cases (e.g., prolonged low-irradiance periods or unfavourable event clustering), the headline 30-node, p event = 0.9 configuration was replicated across four disjoint 50-episode seed blocks (base seeds 42, 142, 242, 342; 200 effective episodes). Two distinct statistical questions are addressed: (i) whether 50 episodes resolve the headline mean to acceptable precision (between-block variance) and (ii) whether worst-case episodes invert the policy ranking (per-episode percentile distribution). These results are reported in Table 7 and Figure 11, respectively.
Between-block variance is small. The between-block standard deviation for RL+Guard is 1.86 pp on survival rate and 0.08 pp on transmission success—both smaller than the within-block 95% confidence interval at N = 50 . This confirms that 50 episodes is sufficient to resolve the headline survival rate to within approximately ±2 pp at the between-seed level, with transmission success effectively deterministic at this sample size.
Block 342 captures a long-tail stress draw. Block 342 produced systematically lower survival across all policies (RL+Guard 3.55 pp, Round-Robin 4.96 pp, Closest 2.22 pp, RL (PPO) 2.55 pp relative to the mean of the other three blocks), consistent with an unfavourable PVGIS solar-yield realisation in this seed’s simulated year. This is precisely the long-tail edge case the methodology must remain robust to. Two observations are decisive: (i) the policy ranking is preserved under stress—RL+Guard remains the highest-survival policy at every block; and (ii) RL+Guard’s relative advantage is preserved or amplified under stress, depending on the comparison baseline: the gap to Round-Robin widens to + 15.67 pp in block 342 versus a mean of + 14.0 pp across the other three blocks (an amplification of + 1.7 pp), while the gap to Closest is essentially unchanged at + 22.47 pp in block 342 versus + 24.0 pp on average (a marginal narrowing of 1.5 pp). Both differences are well within the between-block standard deviation of 1.86 pp, confirming that the policy ranking and approximate advantage magnitudes are stable under unfavourable solar-yield draws.
Pooled per-episode percentiles, n = 200 . Across all 200 episodes, the per-episode 5th-percentile of final-alive-nodes for RL+Guard is 18 nodes; the 95th-percentile is 22; the worst-case episode ended with 16 alive nodes. The corresponding worst case for Closest is 11; for Round-Robin is 12. RL+Guard’s 5th-percentile (18) exceeds the 95th-percentile of every other policy. The transmission-success worst-case for RL+Guard is 98.2 % , exceeding the median of every baseline. Combined with the small between-block variance, these results establish that 50 episodes resolve the headline mean adequately and that long-tail edge cases do not invert the policy ranking.

8. Discussion

8.1. Scale-Dependent Performance Regimes

The results reveal qualitatively distinct performance regimes as network scale increases. At 10 nodes, the network is over-provisioned: energy harvesting exceeds consumption for all policies, and the scheduling decision is largely irrelevant. At 15 nodes, energy constraints become binding, and learned policies begin to outperform heuristics in survival. At 30 nodes, the combinatorial complexity of scheduling decisions exceeds the capacity of simple heuristics, and learned policies with safety constraints dominate.
This scale dependence has practical implications for seismic monitoring network design in the Indian context. For small sub-regional deployments covering a single seismic zone cluster ( N 10 nodes), heuristic Round-Robin scheduling is near-optimal and computationally trivial to implement on resource-constrained hardware. For regional networks covering multiple zone clusters ( 15 N 30 nodes), RL-based scheduling provides measurable survival benefit that compounds over year-long deployment horizons—particularly during monsoon periods when consecutive low-irradiance days stress energy reserves. For national-scale deployments ( N > 30 ), hierarchical architectures in which local controllers manage cluster-level scheduling and a coordination layer handles inter-cluster resource balancing represent the natural architectural extension and are identified as a direction for future work.

8.2. Reward Function Design Rationale

The reward function (7) combines five components whose relative weights were informed by energy-balance analysis rather than arbitrary tuning. The transmission reward magnitude (±2.0) was set to exceed the maximum per-step energy penalty ( 0.2 per transmission plus distance term), ensuring that correct event detection is always incentivised over energy hoarding. The survival incentive ( + 0.01 per alive node) was deliberately set to one-tenth of the energy penalty to avoid overwhelming the transmission objective with a trivial “do nothing” optimum. The death penalties ( 5.0 per dead cluster, 20.0 all-dead) are calibrated to be catastrophic relative to single-step rewards, incentivising the agent to treat node survival as a hard constraint. The load-balance penalty weight ( 0.03 σ ) was set at the same order of magnitude as the per-step survival reward, producing a soft preference for equity without dominating the transmission objective. These relationships reflect deliberate design choices rather than unconstrained hyperparameter search; a full ablation study across reward component subsets is identified as a direction for future work.

8.3. Guard-Layer Effectiveness

The guard-layer implementation demonstrates strictly superior performance over unconstrained PPO at all scales where energy constraints are binding. At 30 nodes, RL+Guard achieves cumulative reward of 8393 ± 186 compared to 7532 ± 233 for unconstrained PPO ( + 11.4 % ), along with higher transmission success ( 99.46 % vs. 98.67 % ) and substantially higher survival ( 66.47 % vs. 57.60 % , + 15.4 % ). The guard layer’s benefit arises because it prevents the policy from selecting nodes with insufficient battery headroom: such selections would incur elevated transmission costs from the stress-degradation model (2) and accelerate depletion, triggering death penalties that suppress long-run cumulative reward. By substituting higher-battery nodes, the guard layer reduces total stress accumulation across the episode and extends network lifetime—both of which increase cumulative reward. The guard layer also substantially improves load fairness: RL+Guard achieves JFI = 0.362 at 30 nodes compared to 0.191 for unconstrained PPO (+89.5%) because the battery-threshold scoring redistributes load away from depleted nodes and onto under-utilised high-battery nodes, producing a more equitable battery distribution than the unconstrained policy’s unbalanced scheduling. The guard layer is transparent when the policy independently selects safe actions, and its overhead is limited to O ( N ) arithmetic operations per step.
The relaxation schedule ensures that the guard layer does not overly constrain early training, when exploration is beneficial. As training progresses and the policy converges, the increasing threshold enforces progressively stricter safety, aligning constraint enforcement with policy maturity.

8.4. Temporal vs. Spatial Fairness

The results confirm a well-documented tension in WSN scheduling between temporal fairness and spatial energy equity. At 15 and 30 nodes, policies with higher transmission rates do not achieve proportionally higher survival. Round-Robin achieves 98.74% transmission success at 30 nodes but only 52.87% survival, while RL+Guard achieves 99.46% transmission and 66.47% survival. This divergence arises because uniform transmission scheduling (Round-Robin) distributes energy load evenly in a temporal sense but does not account for heterogeneous energy harvesting conditions across node locations. Nodes with lower solar irradiance (due to orientation, shading, or latitude effects) bear an equal transmission burden but cannot sustain it, leading to premature failure. The guard layer’s scoring function (10) addresses this well-known spatial fairness challenge by incorporating node-specific battery levels into the selection criterion, enabling spatially adaptive load distribution that accounts for heterogeneous energy harvesting conditions—an adaptation that static temporal-fairness policies such as Round-Robin cannot perform.

8.5. Action Masking and Sample Efficiency

The integration of action masking with PPO is essential for training in the WSN scheduling domain, where the fraction of valid actions changes dynamically. At 30 nodes with N + 1 = 31 possible actions, the number of eligible nodes at any given step may range from 0 (all dead) to 30 (all alive with sufficient battery and event-detectable). Without masking, the agent would waste significant exploration on infeasible actions, leading to slow convergence and suboptimal policies. The mask ensures that the policy gradient is computed only over valid actions, improving sample efficiency and reducing training time. The guard layer provides a complementary mechanism: while the mask prevents infeasible actions during training, the guard prevents unsafe actions during deployment, where “unsafe” is defined by the battery and balance constraints that go beyond simple feasibility.

8.6. Comparison with Constrained Policy Optimisation and Lagrangian-Relaxed PPO

The proposed guard-layer approach differs fundamentally from constrained policy optimisation (CPO) [23] and from Lagrangian-relaxed PPO [35] and other on-policy safe RL methods. CPO modifies the policy update rule to satisfy cost constraints in expectation, which requires defining differentiable cost functions and may lead to overly conservative policies when constraints are tight. Lagrangian-relaxed PPO replaces hard constraints with soft penalties weighted by an adaptive multiplier λ cos t that is tuned to satisfy the constraint in expectation; this preserves PPO’s optimisation machinery but inherits two weaknesses for the present setting. First, soft constraints are only satisfied in expectation over the discounted-return distribution. For safety-critical battery thresholds, expectation-level satisfaction is insufficient: a policy that violates the threshold on 5% of timesteps but compensates by over-conserving on the remaining 95% will be reported as feasible by the Lagrangian objective even though it produces real depletion events. The guard layer enforces the threshold per-step by construction, providing the worst-case guarantee that expectation-level methods cannot. Second, the multiplier λ cos t requires careful tuning and is known to oscillate during training [36], producing instability in the constraint-satisfaction trajectory; our guard layer has no learned multiplier and therefore no oscillation by construction.
The guard layer instead operates as a post hoc filter: the policy is trained with an unconstrained (but reward-shaped) objective, and constraint enforcement is applied at inference time. This decoupled design has three advantages. First, the policy can explore freely during training, potentially discovering high-reward trajectories that a constrained update would suppress. Second, the guard layer’s parameters ( τ batt , α , ρ t ) can be tuned independently of training, enabling domain experts to adjust safety margins without retraining—in contrast, modifying the cost function in CPO or the constraint threshold in Lagrangian-PPO requires a full training run. Third, the guard layer provides interpretable intervention: each substituted action has a clear reason (low battery, imbalanced load), supporting auditability in safety-critical deployments.
Empirical comparison against Lagrangian-relaxed PPO. To complement the theoretical comparison, we trained a Lagrangian-relaxed Maskable PPO baseline at the headline configuration (30 nodes, p event = 0.9 , 2,000,000 training steps, identical hyperparameters to the unconstrained PPO baseline). The cost signal is the fraction of alive nodes whose battery is below the same safety threshold ( τ batt ) used by the guard layer; the multiplier λ cos t is updated every 2048 environment steps via gradient ascent on the constraint violation, capped at λ max = 50 to prevent divergence. Evaluation followed the same 50-episode protocol as the headline experiments. Results are summarised in Table 8.
The empirical comparison directly supports the theoretical argument above. Lagrangian-PPO improves on unconstrained PPO in cumulative reward (7836 vs. 7532, + 4.0 % ) but produces no meaningful improvement on network survival ( 56.60 % vs. 57.60 % for unconstrained PPO, a difference of less than the between-block standard deviation reported in Section 7.9). RL+Guard beats Lagrangian-PPO on every metric of practical interest: survival is 9.87 pp higher, transmission success 1.31 pp higher, reward + 557 , and load-balance CV substantially better ( 1.33 vs. 1.94 ). The per-step guard layer therefore provides the network-survival benefit that the expectation-level Lagrangian formulation does not deliver at this constraint threshold, even though both formulations target the same battery-threshold quantity.
Multiplier trajectory and contextualisation of Stooke et al. [36]. For full transparency, the multiplier λ cos t in our run did not exhibit the oscillation behaviour widely reported in the literature; instead, it decayed monotonically to zero as the policy learned to satisfy the chosen cost threshold ( 0.05 , i.e., at most 5 % of alive nodes below the safety threshold in expectation). The empirically observed average cost over training was 0.009 , well below the threshold, which explains the monotonic decay: the constraint was never tight enough to push the multiplier upward. We note that Stooke et al. specifically discuss oscillation regimes that arise under tighter constraint thresholds where the policy cannot trivially satisfy them; our setting represents the regime where the Lagrangian formulation behaves stably but also where its expectation-level guarantee is empirically insufficient to match per-step enforcement. Tighter cost thresholds would likely produce the oscillation pattern in the cited literature; the guard layer’s per-step filter avoids this entire class of training-stability concerns by construction.

8.7. Simulation Fidelity

The simulation environment was implemented as a custom OpenAI Gymnasium environment rather than using an existing WSN simulation platform (e.g., ns-3, Castalia) to enable tight integration with the RL training loop and to model domain-specific energy dynamics at the required fidelity. This choice avoids the abstraction overhead of general-purpose network simulators while preserving the ability to directly control the energy model at the component level. The simulator’s energy model is grounded in empirically validated components. Solar irradiance profiles are sourced from the European Commission’s PVGIS database [28], which provides satellite-derived hourly irradiance estimates validated against ground-station measurements across India [37]. The battery model incorporates voltage sag under load, Peukert-effect capacity reduction at high discharge rates, and temperature-dependent efficiency—phenomena well characterised in the lithium-ion battery literature [38]. Transmission energy costs follow a distance-dependent attenuation model consistent with the log-distance path loss model widely adopted in WSN simulation studies [39]. Idle and state-report energy costs ( C idle = 9.0 J/h, C sr = 0.3 J/step) are within the range reported for commercial WSN hardware such as the MICAz and TelosB platforms [40]. While the simulator does not model channel fading, packet collisions, or multi-hop routing, these simplifications are standard in the WSN scheduling literature [6,7] and isolate the scheduling decision from communication-layer effects, enabling a controlled comparison of policy performance. Validation of the integrated model against physical testbed data is identified as a priority for future work.

8.8. Scalability Considerations

Asymptotic complexity of the guard layer. At each scheduling step, the guard performs three operations whose costs are well defined in N. (i) The eligibility test (alive, sufficient battery, in detection radius) is an O ( N ) pass over the node set. (ii) The scoring function (10) requires computing the mean battery E ¯ across eligible nodes ( O ( N ) ) and then a single scoring pass ( O ( N ) ) to identify the highest-scoring eligible node, where Jain’s-fairness-equivalent load metrics are evaluated incrementally rather than over a quadratic candidate set. (iii) The action substitution and masking step is O ( 1 ) given the score. The aggregate per-step complexity is therefore Θ ( N ) in the number of nodes. There is no candidate-action enumeration that scales super-linearly in N: the guard scores each eligible node once and selects the maximum, rather than evaluating all N 2 or N 2 pairs. Concretely, at N = 30 , the per-step overhead measured on a single CPU thread was <0.1 ms; extrapolating linearly under the Θ ( N ) bound, N = 100 corresponds to approximately 0.3 ms and N = 500 to approximately 1.5 ms per decision, both negligible relative to the hourly scheduling cadence and within the budget of resource-constrained embedded controllers (e.g., ARM Cortex-M class).
Empirical validation regime and projected scaling beyond it. The fixed-size wrapper supports up to C max × M max nodes with zero-padding and masking, avoiding retraining for smaller networks. As N increases beyond approximately 100 nodes, the policy network rather than the guard becomes the bottleneck: the action space grows large and sparse, the per-cluster observation aggregation loses granularity, and PPO sample efficiency degrades. For deployments at N = 300 500 , a hierarchical architecture is the natural extension: local controllers manage intra-cluster scheduling using the guard-enhanced PPO framework described here, while a coordination layer handles inter-cluster resource balancing and event routing. Each local controller operates over a tractable cluster size ( N local 30 ), preserving the regime where the proposed approach has been empirically validated, while the coordination layer can employ lightweight heuristics or a separate RL agent trained at the cluster-scheduling granularity. We note that simulating N = 100 or N = 500 end to end at the temporal resolution used here ( T max = 8760 steps, full-year horizon, PVGIS-driven solar yield) was computationally prohibitive within the present study’s resources; we therefore present the Θ ( N ) analysis and the hierarchical-architecture sketch in lieu of empirical large-scale validation and identify it as the priority for follow-on work.

9. Limitations and Future Work

The current work has several limitations that motivate future investigation. The system model assumes single-hop communication; extending the guard layer to multi-hop relay scoring is non-trivial. The guard layer provides empirical constraint enforcement but lacks formal safety guarantees—future work should explore control barrier functions or Lyapunov-based certificates for provable battery bounds. The evaluation considers up to 30 nodes; scalability to 100+ node networks requires hierarchical or federated learning architectures. While the sensitivity analysis (Section 7.8) demonstrates robustness across event rates, real seismic activity is spatially correlated and temporally non-stationary, and the PVGIS solar profile covers a single geographic location. The radio model assumes ideal communication without channel fading or packet loss. Finally, the current evaluation uses a single spatial topology per scale; whether policies generalise across topologies without retraining remains an open question that domain randomisation during training could address.

10. Conclusions

This paper has presented a safety-constrained reinforcement learning framework for transmission scheduling in energy-harvesting seismic wireless sensor networks. The integration of PPO with action masking and a runtime guard-layer safety filter demonstrates that hard safety constraints can be imposed on learned policies with minimal performance degradation. At 30 nodes, the guard-enhanced PPO achieves 99.46% transmission success and 66.47% node survival—a 58.3% improvement in survival over the highest-reward heuristic baseline (Closest)—while simultaneously outperforming unconstrained PPO on reward, transmission success, and survival. The guard layer’s scoring mechanism, combining battery headroom with load equity, provides an interpretable and configurable safety interface between the learned policy and the physical network. Sensitivity analysis across event rates confirms that these benefits are robust to operating conditions and not artefacts of the extreme stress-test scenario. The experimental analysis across three network scales identifies distinct operational regimes and quantifies the reward–survival trade-off that is fundamental to energy-constrained WSN scheduling. These results establish that safety-constrained RL is a viable and effective approach for long-term autonomous management of critical monitoring infrastructure.

Author Contributions

Conceptualization, I.N.; methodology, I.N.; software, I.N.; validation, I.N.; formal analysis, I.N.; investigation, I.N.; writing—original draft preparation, I.N.; writing—review and editing, I.N. and A.R.; visualization, I.N.; supervision, A.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The simulation code and trained model weights used in this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
WSNWireless Sensor Network
RLReinforcement Learning
PPOProximal Policy Optimisation
GAEGeneralised Advantage Estimation
JFIJain’s Fairness Index
PVGISPhotovoltaic Geographical Information System
MDPMarkov Decision Process
CPOConstrained Policy Optimisation

References

  1. Akyildiz, I.F.; Su, W.; Sankarasubramaniam, Y.; Cayirci, E. Wireless sensor networks: A survey. Comput. Netw. 2002, 38, 393–422. [Google Scholar] [CrossRef]
  2. Culler, D.; Estrin, D.; Srivastava, M. Overview of sensor networks. IEEE Comput. 2004, 37, 41–49. [Google Scholar] [CrossRef]
  3. Jornet-Monteverde, J.A.; Galiana-Merino, J.J.; Soler-Llorens, J.L. Design and implementation of a wireless sensor network for seismic monitoring of buildings. Sensors 2021, 21, 3875. [Google Scholar] [CrossRef]
  4. Sethi, P.; Sarangi, S.R. Internet of Things: Architectures, protocols, and applications. J. Elect. Comput. Eng. 2017, 2017, 9324035. [Google Scholar] [CrossRef]
  5. Hahne, E.L. Round-robin scheduling for max-min fairness in data networks. IEEE J. Sel. Areas Commun. 1991, 9, 1024–1039. [Google Scholar] [CrossRef]
  6. Heinzelman, W.R.; Chandrakasan, A.; Balakrishnan, H. Energy-efficient communication protocol for wireless microsensor networks. In Proceedings of the 33rd HICSS, Maui, HI, USA, 4–7 January 2000; pp. 3005–3014. [Google Scholar]
  7. Lindsey, S.; Raghavendra, C.S. PEGASIS: Power-efficient gathering in sensor information systems. In Proceedings of the IEEE Aerospace Conference, Big Sky, MT, USA, 9–16 March 2002; pp. 1125–1130. [Google Scholar]
  8. Li, Y. Deep reinforcement learning: An overview. arXiv 2017, arXiv:1701.07274. [Google Scholar]
  9. Guo, W.; Yan, C.; Lu, T. Optimizing the lifetime of wireless sensor networks via reinforcement-learning-based routing. Int. J. Distrib. Sens. Netw. 2019, 15, 1550147719833541. [Google Scholar] [CrossRef]
  10. Ye, W.; Heidemann, J.; Estrin, D. An energy-efficient MAC protocol for wireless sensor networks. In Proceedings of the IEEE INFOCOM, New York, NY, USA, 23–27 June 2002; Volume 3, pp. 1567–1576. [Google Scholar]
  11. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
  12. Schulman, J.; Moritz, P.; Levine, S.; Jordan, M.; Abbeel, P. High-dimensional continuous control using generalized advantage estimation. In Proceedings of the ICLR, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
  13. Huang, S.; Ontañón, S. A closer look at invalid action masking in policy gradient algorithms. In Proceedings of the FLAIRS, Jensen Beach, FL, USA, 15–18 May 2022. [Google Scholar]
  14. Guo, H.; Wu, R.; Qi, B.; Xu, C. Deep-Q-Networks-based adaptive dual-mode energy-efficient routing in rechargeable wireless sensor networks. IEEE Sens. J. 2022, 22, 9956–9966. [Google Scholar] [CrossRef]
  15. Barat, A.; Prabuchandran, K.J.; Bhatnagar, S. Energy management in a cooperative energy harvesting wireless sensor network. IEEE Commun. Lett. 2024, 28, 243–247. [Google Scholar] [CrossRef]
  16. Al-Abiad, M.A.; Khan, A.Z.; De Domenico, M. Deep reinforcement learning-based energy consumption optimization for peer-to-peer communication in wireless sensor networks. Sensors 2024, 24, 1632. [Google Scholar]
  17. Alagha, A.; Singh, S.; Mizouni, R.; Otrok, S.S.; Mourad, A. Data-driven dynamic active node selection for event localization in IoT applications—A case study of radiation monitoring. IEEE Access 2019, 7, 16168–16183. [Google Scholar] [CrossRef]
  18. Wang, Y.; Xie, S.; Liu, X. Multi-agent deep reinforcement learning for task offloading in group M2M communications. Sensors 2021, 21, 3226. [Google Scholar]
  19. Luong, N.C.; Hoang, D.T.; Gong, S.; Niyato, D.; Wang, P.; Liang, Y.-C.; Kim, D.I. Applications of deep reinforcement learning in communications and networking: A survey. IEEE Commun. Surv. Tutor. 2019, 21, 3133–3174. [Google Scholar] [CrossRef]
  20. Liu, Z.; Chen, Y.; Chen, B.; Zhu, S.; Yang, K. Multi-agent reinforcement learning for long-term network resource management in IoT. IEEE Trans. Mob. Comput. 2023, 22, 3264–3279. [Google Scholar]
  21. García, J.; Fernández, F. A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 2015, 16, 1437–1480. [Google Scholar]
  22. Altman, E. Constrained Markov Decision Processes; Chapman & Hall/CRC: Boca Raton, FL, USA, 1999. [Google Scholar]
  23. Achiam, J.; Held, D.; Tamar, A.; Abbeel, P. Constrained policy optimization. In Proceedings of the ICML, Sydney, Australia, 6–11 August 2017; pp. 22–31. [Google Scholar]
  24. Alshiekh, M.; Bloem, R.; Ehlers, R.; Könighofer, B.; Niekum, S.; Topcu, U. Safe reinforcement learning via shielding. In Proceedings of the AAAI, New Orleans, LA, USA, 2–7 February 2018; pp. 2669–2678. [Google Scholar]
  25. Gu, S.; Yang, L.; Du, Y.; Chen, G.; Walter, F.; Wang, J.; Knoll, A. A review of safe reinforcement learning: Methods, theory and applications. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 11216–11235. [Google Scholar] [CrossRef] [PubMed]
  26. Xiao, W.; Molnar, T.G.; Orosz, G.; Ames, A.D. Barriernet: A safety-guaranteed layer for neural networks. Proc. Mach. Learn. Res. 2023, 168, 1–11. [Google Scholar]
  27. India Meteorological Department. Seismological Bulletin; National Seismological Network: New Delhi, India, 2023. Available online: https://seismo.gov.in/ (accessed on 15 March 2026).
  28. Huld, T.; Müller, R.; Gambardella, A. A new solar radiation database for estimating PV performance in Europe and Africa. Sol. Energy 2012, 86, 1803–1815. [Google Scholar] [CrossRef]
  29. Lei, J.; Yates, R.; Greenstein, L. A generic model for optimizing single-hop transmission policy of replenishable sensors. IEEE Trans. Wirel. Commun. 2009, 8, 547–556. [Google Scholar] [CrossRef]
  30. Allen, R.V. Automatic earthquake recognition and timing from single traces. Bull. Seismol. Soc. Am. 1978, 68, 1521–1532. [Google Scholar] [CrossRef]
  31. Trnkoczy, A. Understanding and parameter setting of STA/LTA trigger algorithm. In New Manual of Seismological Observatory Practice 2 (NMSOP-2); IASPEI: Kjeller, Norway, 2012; pp. 1–20. [Google Scholar]
  32. Mousavi, S.M.; Ellsworth, W.L.; Zhu, W.; Chuang, L.Y.; Beroza, G.C. Earthquake transformer—An attentive deep-learning model for simultaneous earthquake detection and phase picking. Nat. Commun. 2020, 11, 3952. [Google Scholar] [CrossRef]
  33. Raffin, A.; Hill, A.; Gleave, A.; Kanervisto, A.; Ernestus, M.; Dormann, N. Stable-Baselines3: Reliable reinforcement learning implementations. J. Mach. Learn. Res. 2021, 22, 1–8. [Google Scholar]
  34. Jain, R.; Chiu, D.-M.; Hawe, W.R. A Quantitative Measure of Fairness and Discrimination for Resource Allocation in Shared Computer Systems; Technical Report TR-301; DEC Research: Palo Alto, CA USA, 1984. [Google Scholar]
  35. Ray, A.; Achiam, J.; Amodei, D. Benchmarking Safe Exploration in Deep Reinforcement Learning; OpenAI Technical Report; OpenAI: San Francisco, CA, USA, 2019. [Google Scholar]
  36. Stooke, A.; Achiam, J.; Abbeel, P. Responsive safety in reinforcement learning by PID Lagrangian methods. In Proceedings of the ICML, Online, 13–18 July 2020; pp. 9133–9143. [Google Scholar]
  37. Urraca, M.; Huld, T.; Gracia-Amillo, A.; Martinez-de-Pison, F.J.; Kaspar, F.; Sanz-Garcia, A. Evaluation of global horizontal irradiance estimates from ERA5 and COSMO-REA6 reanalyses using ground and satellite-based data. Sol. Energy 2019, 188, 1049–1062. [Google Scholar] [CrossRef]
  38. Chen, M.; Rincón-Mora, G.A. Accurate electrical battery model capable of predicting runtime and I–V performance. IEEE Trans. Energy Convers. 2006, 21, 504–511. [Google Scholar] [CrossRef]
  39. Rappaport, T.S. Wireless Communications: Principles and Practice, 2nd ed.; Prentice Hall: Upper Saddle River, NJ, USA, 2002. [Google Scholar]
  40. Anastasi, G.; Conti, M.; Di Francesco, M.; Passarella, A. Energy conservation in wireless sensor networks: A survey. Ad Hoc Netw. 2009, 7, 537–568. [Google Scholar] [CrossRef]
Figure 1. Inference-time pipeline: the environment emits observations and action masks; the PPO policy produces a candidate action; the guard layer filters it; the safe action is executed.
Figure 1. Inference-time pipeline: the environment emits observations and action masks; the PPO policy produces a candidate action; the guard layer filters it; the safe action is executed.
Sensors 26 03542 g001
Figure 2. Sensitivity of RL+Guard performance to the load-balancing weight α . The shaded green band marks the rank-invariant interval α [ 0 , 1 ) predicted by Equation (12). The manuscript’s choice α = 0.2 sits in the interior of this interval. Only at α = 1.0 , where the score function degenerates to a constant, is a behavioural difference observed.
Figure 2. Sensitivity of RL+Guard performance to the load-balancing weight α . The shaded green band marks the rank-invariant interval α [ 0 , 1 ) predicted by Equation (12). The manuscript’s choice α = 0.2 sits in the interior of this interval. Only at α = 1.0 , where the score function degenerates to a constant, is a behavioural difference observed.
Sensors 26 03542 g002
Figure 3. Mean total reward across network scales. Error bars indicate 95% confidence intervals ( N = 50 episodes).
Figure 3. Mean total reward across network scales. Error bars indicate 95% confidence intervals ( N = 50 episodes).
Sensors 26 03542 g003
Figure 4. Node survival rate (%) across network scales.
Figure 4. Node survival rate (%) across network scales.
Sensors 26 03542 g004
Figure 5. Transmission success rate (%) across network scales.
Figure 5. Transmission success rate (%) across network scales.
Sensors 26 03542 g005
Figure 6. Jain’s Fairness Index (JFI) across network scales. Higher JFI indicates more equitable battery distribution ( JFI = 1.0 : perfect fairness).
Figure 6. Jain’s Fairness Index (JFI) across network scales. Higher JFI indicates more equitable battery distribution ( JFI = 1.0 : perfect fairness).
Sensors 26 03542 g006
Figure 7. Mean alive nodes over simulation time (30-node network, 8760 hourly steps).
Figure 7. Mean alive nodes over simulation time (30-node network, 8760 hourly steps).
Sensors 26 03542 g007
Figure 8. Mean battery level over simulation time (30-node network).
Figure 8. Mean battery level over simulation time (30-node network).
Sensors 26 03542 g008
Figure 9. Reward–survival scatter across all policy–scale combinations. Marker shape indicates network scale (circle: 10, square: 15, triangle: 30 nodes).
Figure 9. Reward–survival scatter across all policy–scale combinations. Marker shape indicates network scale (circle: 10, square: 15, triangle: 30 nodes).
Sensors 26 03542 g009
Figure 10. RL+Guard maintains the highest survival rate at every p event (left) while sacrificing ≤6.4% on reward relative to Closest (right). The survival advantage grows monotonically from + 1.45 pp at p = 0.1 to + 13.78 pp at p = 0.9 over the strongest non-trivial heuristic (Round-Robin); no break-even point at which heuristics surpass RL+Guard exists in the realistic range.
Figure 10. RL+Guard maintains the highest survival rate at every p event (left) while sacrificing ≤6.4% on reward relative to Closest (right). The survival advantage grows monotonically from + 1.45 pp at p = 0.1 to + 13.78 pp at p = 0.9 over the strongest non-trivial heuristic (Round-Robin); no break-even point at which heuristics surpass RL+Guard exists in the realistic range.
Sensors 26 03542 g010
Figure 11. (Left) Between-block survival-rate means with one-stdev error bars and min/max range across four 50-episode blocks. (Right) Pooled per-episode distribution of final alive nodes across all 200 episodes; the box-and-whisker shows median, IQR, and outliers. RL+Guard’s full distribution lies above every other policy’s full distribution, with a 5th-percentile of 18 alive nodes that exceeds the 95th-percentile of every baseline.
Figure 11. (Left) Between-block survival-rate means with one-stdev error bars and min/max range across four 50-episode blocks. (Right) Pooled per-episode distribution of final alive nodes across all 200 episodes; the box-and-whisker shows median, IQR, and outliers. RL+Guard’s full distribution lies above every other policy’s full distribution, with a 5th-percentile of 18 alive nodes that exceeds the 95th-percentile of every baseline.
Sensors 26 03542 g011
Table 1. Energy model parameters.
Table 1. Energy model parameters.
ParameterSymbolValue
Battery capacity E max 15,000 J
Idle power P idle 2.5 mW
Idle cost/step C idle 9.0 J
Base TX cost C tx , base 3.0 J
TX stress growth γ stress 0.0015
Aging rate γ age 5 × 10 5
Panel efficiency η 12%
Panel areaA1.2  cm 2
State-report scale10% of TX cost
Peak harvest H peak 49.8 J/h
Detection radius R detect 250 km
Event probability p event 0.9
Table 2. PPO training hyperparameters.
Table 2. PPO training hyperparameters.
ParameterValue
Learning rate α 5 × 10 4 5 × 10 6 (linear decay)
Network architectureMLP [256, 256]
Batch size256
Steps per rollout8760 (one full episode)
Epochs per update10
Entropy coefficient0.02
Value function coefficient0.5
Clipping parameter ϵ 0.2
Discount factor γ 0.99
GAE parameter λ 0.95
Table 3. Guard-layer load-balancing weight sensitivity at 30 nodes (RL+Guard, p event = 0.9 , 25 episodes). Metrics are bit-identical for α [ 0 , 0.5 ] in agreement with the rank-invariance derivation (12); the degeneracy at α = 1.0 is empirically confirmed. Bold values denote the configuration adopted in the manuscript ( α = 0.2 ).
Table 3. Guard-layer load-balancing weight sensitivity at 30 nodes (RL+Guard, p event = 0.9 , 25 episodes). Metrics are bit-identical for α [ 0 , 0.5 ] in agreement with the rank-invariance derivation (12); the degeneracy at α = 1.0 is empirically confirmed. Bold values denote the configuration adopted in the manuscript ( α = 0.2 ).
α Survival (%)Tx (%)RewardLoad-Balance CVFinal Alive (Mean)
0.065.4799.348280.221.33219.64
0.165.4799.348280.221.33219.64
0.265.4799.348280.221.33219.64
0.365.4799.348280.221.33219.64
0.565.4799.348280.221.33219.64
1.062.8099.268258.211.68718.84
Table 4. Performance metrics across network scales (mean ± 95% CI, t-distribution, d f = 49 , N = 50 episodes). JFI values are point estimates averaged over episodes; per-episode CIs require full per-node battery distributions not retained in the summary pipeline.
Table 4. Performance metrics across network scales (mean ± 95% CI, t-distribution, d f = 49 , N = 50 episodes). JFI values are point estimates averaged over episodes; per-episode CIs require full per-node battery distributions not retained in the summary pipeline.
ScalePolicyRewardTx Succ. (%)Surv. (%)Avg Batt. (%)Load JFI
10 nodesFixed 1766 ± 70 23.54100.0093.781.000
Round-Robin 10 , 849 ± 67 100.00100.0093.771.000
Closest 11 , 770 ± 70 100.00100.0093.771.000
RL (PPO) 9937 ± 65 100.00100.0093.761.000
RL+Guard 11 , 278 ± 68 100.00100.0093.771.000
15 nodesFixed 2961 ± 247 31.8866.5375.250.596
Round-Robin 5013 ± 290 95.7036.5389.560.399
Closest 5961 ± 328 95.0834.8091.200.394
RL (PPO) 3038 ± 324 96.7454.1378.900.186
RL+Guard 4634 ± 273 97.5862.9373.690.398
30 nodesFixed 4059 ± 191 22.7169.2075.620.669
Round-Robin 7865 ± 181 98.7452.8775.510.425
Closest 8944 ± 190 97.3342.0084.490.381
RL (PPO) 7532 ± 233 98.6757.6078.400.191
RL+Guard 8393 ± 186 99.4666.4774.310.362
Fixed JFI reflects a bimodal battery distribution (one depleted node, remainder full) rather than genuine load equity; see Section 7.4.
Table 5. Percentage improvement of RL+Guard over baselines (30 Nodes).
Table 5. Percentage improvement of RL+Guard over baselines (30 Nodes).
Baseline Δ Reward Δ Tx Δ Surv. Δ JFI
Fixed +306.8%+338.5%−3.9%
Round-Rob.+6.7%+0.7%+25.7%−14.8%
Closest−6.2%+2.2%+58.3%−5.0%
RL (PPO)+11.4%+0.8%+15.4%+89.5%
Fixed JFI reflects bimodal battery distribution (one depleted node, remainder full) rather than genuine load equity; see Section 7.4.
Table 6. Sensitivity to event probability at 30 nodes (30 episodes per cell). Surv. gap columns give RL+Guard’s survival advantage in percentage points over the corresponding heuristic. The advantage grows monotonically with p event ; no break-even exists in the realistic range. Italic rows are metric-panel subheadings; bold values indicate the proposed RL+Guard method in the survival-rate panel and the highest value in each row in the reward panel.
Table 6. Sensitivity to event probability at 30 nodes (30 episodes per cell). Surv. gap columns give RL+Guard’s survival advantage in percentage points over the corresponding heuristic. The advantage grows monotonically with p event ; no break-even exists in the realistic range. Italic rows are metric-panel subheadings; bold values indicate the proposed RL+Guard method in the survival-rate panel and the highest value in each row in the reward panel.
p event FixedRRClosestRLRL+GuardGap vs. RRGap vs. Cls
Survival rate (%)
0.168.8967.3366.1167.3368.78 + 1.45 + 2.67
0.268.8965.5662.4465.1168.56 + 3.00 + 6.12
0.368.8964.7859.1162.1168.22 + 3.44 + 9.11
0.568.8959.6750.3359.6767.44 + 7.77 + 17.11
0.768.8956.0045.3358.3367.33 + 11.33 + 22.00
0.968.8952.4441.7857.2266.22 + 13.78 + 24.44
Mean total reward
0.1−11625810243218
0.5−26363952439435294360
0.9−40807895903474728457
Transmission success (%)
0.122.8499.5899.4099.4399.64
0.522.5799.2598.3398.8799.55
0.922.7698.8497.2098.6399.44
Table 7. Between-block variance for RL+Guard at 30 nodes, p event = 0.9 . Each block is an independent 50-episode draw; stdev is computed across the four block-level means.
Table 7. Between-block variance for RL+Guard at 30 nodes, p event = 0.9 . Each block is an independent 50-episode draw; stdev is computed across the four block-level means.
MetricBlock 42Block 142Block 242Block 342MeanStdev
Survival rate (%)66.4765.2766.3362.4765.131.86
Mean total reward83938338822678978214222
Tx success (%)99.4699.3899.3599.2699.360.08
Final alive nodes19.9419.5819.9018.7419.540.56
Table 8. Empirical comparison at 30 nodes, p event = 0.9 , N = 50 episodes. RL+Guard wraps the unconstrained PPO model with the runtime guard; Lagrangian-PPO replaces the guard with a soft constraint enforced by an adaptive multiplier. The Lagrangian formulation extracts marginally higher reward than unconstrained PPO but does not improve network survival; RL+Guard wins on every metric. Bold values indicate the best (winning) value in each column.
Table 8. Empirical comparison at 30 nodes, p event = 0.9 , N = 50 episodes. RL+Guard wraps the unconstrained PPO model with the runtime guard; Lagrangian-PPO replaces the guard with a soft constraint enforced by an adaptive multiplier. The Lagrangian formulation extracts marginally higher reward than unconstrained PPO but does not improve network survival; RL+Guard wins on every metric. Bold values indicate the best (winning) value in each column.
MethodSurvival (%)Tx Success (%)RewardLoad-Balance CV
RL (unconstrained PPO)57.6098.6775322.06
Lagrangian-relaxed PPO56.6098.1578361.94
RL+Guard (proposed)66.4799.4683931.33
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Nazamdin, I.; Reid, A. Safety-Constrained Reinforcement Learning for Energy-Aware Transmission Scheduling in Seismic Wireless Sensor Networks. Sensors 2026, 26, 3542. https://doi.org/10.3390/s26113542

AMA Style

Nazamdin I, Reid A. Safety-Constrained Reinforcement Learning for Energy-Aware Transmission Scheduling in Seismic Wireless Sensor Networks. Sensors. 2026; 26(11):3542. https://doi.org/10.3390/s26113542

Chicago/Turabian Style

Nazamdin, Isa, and Alistair Reid. 2026. "Safety-Constrained Reinforcement Learning for Energy-Aware Transmission Scheduling in Seismic Wireless Sensor Networks" Sensors 26, no. 11: 3542. https://doi.org/10.3390/s26113542

APA Style

Nazamdin, I., & Reid, A. (2026). Safety-Constrained Reinforcement Learning for Energy-Aware Transmission Scheduling in Seismic Wireless Sensor Networks. Sensors, 26(11), 3542. https://doi.org/10.3390/s26113542

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop