3.7. Proposed Method
Step 1: Initial Modeling of Nodes, Network Topology, and Physical Resources.
We consider a static WSN deployed in a rectangular region  with a single sink at  (extension to multiple sinks  is straightforward). Nodes are placed i.i.d. uniformly at random in  and have limited battery, compute, and memory resources.
Node positions and initial energy. Let 
N be the number of sensor nodes. The position of node 
i is 
, and all nodes start with the same initial energy:
        We assume ideal localization (each node knows 
 and the sink broadcasts 
); if localization noise is present, it can be modeled as 
 with 
.
 Distance and neighborhood. The Euclidean distance between nodes 
i and 
j is
        A directed link 
 is feasible if 
, where 
 is the PHY-layer communication range.
 Radio/energy model. We adopt the first-order radio model with a free-space/multipath crossover distance 
. Let 
k be the packet size in bits, 
 the electronics energy per bit (TX/RX), 
 and 
 the amplifier coefficients for free-space (
) and multipath (
) propagation, respectively. The per-packet transmit and receive energies are: 
        If on-node processing is non-negligible, we include a per-bit CPU cost 
 so that the processing energy for 
k bits is 
.
 Battery dynamics. Let 
 and 
 denote the sets of packets transmitted to and received by node 
i during slot/epoch 
t, and let 
 be the corresponding TX distances. The battery recursion is
        with node 
i considered dead when 
. Control traffic (e.g., HELLO/ACK, FRL uploads) is accounted for by using (
6) and (7) with the appropriate control packet sizes.
 Step 2: Neighbor Discovery and Initialization of Communication Metrics.
Each node discovers neighbors and initializes link metrics (distance, reliability, trust primitives) using a lightweight HELLO/ACK procedure over IEEE 802.15.4 Carrier-Sense Multiple Access with Collision Avoidance (CSMA/CA).
MAC/PHY and timing. Time is partitioned in discovery periods of length . To reduce collisions, each node transmits one HELLO per period at time  where  is a jitter (). We use unslotted CSMA/CA with maximum re-transmissions  for data/ACK frames. Let  and  be the HELLO and ACK payload sizes (bits).
Control frame formats. Node i broadcasts  ( is a local timestamp). Upon reception, neighbor j unicasts .
Neighbor set and distances. The (directed) neighbor set of i at period t is . Distances follow Step 1: , .
Link reliability estimator. Let 
 indicate whether a valid ACK from 
j to 
i was received in period 
t. We maintain both a sliding-window estimator over the last 
 periods and an EWMA:
        with 
. A link is considered usable if 
. Neighbors that fail to respond for 
 consecutive periods are pruned.
 RTT and RSSI sampling. For each ACK, node i estimates Round-Trip Time , and records receiver-side RSSI (from the radio), yielding time series  and .
Lightweight anomaly detector (for trust primitives). Per link 
, define residuals for a generic scalar metric 
, where 
. Maintain EWMA mean/variance:
        with 
. The standardized residual is 
 (
 small). We flag an anomaly on 
 at time 
t if either 
 or the windowed loss 
 exceeds 
, for at least 
M consecutive periods. The Boolean flag 
 is passed to Step 4’s trust update.
 Control plane energy accounting. Let 
 denote the one-shot success probability for 
 transmissions in period 
t (estimated by 
 or PHY metrics). The expected number of transmissions (including CSMA/CA retries, truncated at 
) is 
. Assuming a fixed broadcast range 
 for HELLOs, the expected control energy of node 
i per discovery period is
        which is debited in the battery recursion (Equation (
9)).
 Initialization of trust primitives. Behavioral trust is initialized neutrally:  () and will be updated in Step 4 using , , and RTT residuals. Algorithm 6 presents the neighbor discovery and metric initialization procedure.
Step 3: Defining the Reinforcement Learning Space for Network Nodes (Modeling RL Agents).
Each node  is modeled as an RL agent that selects its next-hop parent based on local state features. To ensure tractability, continuous state variables are discretized into finite bins, enabling tabular Q-learning.
State space. The state of agent 
i at time 
t is represented as:
 : residual energy of node i, discretized into  bins (e.g., 10 uniform bins between 0 and ).
: distance to sink, discretized into concentric rings .
: queue congestion index, discretized into  levels (e.g., ).
: path quality score (delay/PDR composite), mapped to  bins using quantiles or thresholds.
: estimated end-to-end security probability from i to the sink, discretized into  intervals in .
Thus, the effective state space is finite with cardinality .
Action space. At each step, node 
i selects one neighbor as the next-hop parent:
        Here, 
 is the neighbor set from Step 2 and 
 is the minimum energy threshold for participation.
 Reward function. The reward encourages low-energy, low-delay, and secure routing. To avoid mixing different physical units, each term is normalized:
        where:
 : transmit energy consumed by node i in epoch t, normalized by a maximum reference .
: End-to-End Delay from node i to sink in epoch t, normalized by a maximum tolerable delay .
: path-level security probability, computed as in Step 4, normalized relative to the threshold .
        The weights 
 satisfy 
, balancing the trade-offs among energy, delay, and security.
        
| Algorithm 6 Neighbor discovery and metric initialization (per period of length ) | 
![Mathematics 13 03196 i006 Mathematics 13 03196 i006]()  | 
Decision policy. Each agent follows an 
-greedy policy over its Q-table:
        The exploration rate 
 decays over time, ensuring sufficient exploration initially while converging toward exploitation of high-value routes.
 Step 4: Security Evaluation and Path Trustworthiness Assessment.
To ensure that the paths chosen in Step 3 remain resilient against misbehavior and channel anomalies, each node maintains a dynamic trust score per link .
Hybrid link trust. For each neighbor 
, node 
i computes a hybrid link trust:
        where:
  is the historical trust, estimated from the smoothed link reliability ).
 is the behavioral trust, adapted from anomaly detector outputs ().
 balances long-term reliability vs. short-term anomaly evidence.
Behavioral trust update. At each discovery period, behavioral trust is updated as:
        with 
 the penalty decrement and 
 the recovery increment.
 Path-level security aggregation. To avoid underflow from multiplicative products, path security is aggregated in the log domain:
        This formulation yields the geometric mean of link trust scores, ensuring longer paths are not unfairly penalized while still reflecting weak links.
 Constraint vs. reward. To avoid double counting, we enforce a 
constraint-only role for security: a candidate path 
p is admissible if
        otherwise, it is excluded from the action set 
 in Step 3. This ensures that the reward need not include a separate security term, keeping objectives decoupled.
 Parameter ranges. Typical values are:  (balance), – (penalty), – (recovery), and – (acceptance threshold). These can be tuned in the simulation setup. Algorithm 7 describes the process of security evaluation and trust updating.
Step 5: Local Q-Learning Model Training for Nodes.
Each node  trains its own Q-learning agent over the discretized state/action space defined in Step 3, subject to the admissibility constraints from Step 4. Learning proceeds in episodes, where each episode corresponds to a sequence of packet transmissions and acknowledgments until either the sink is reached or a timeout occurs.
Q-learning update. At each decision epoch 
t, node 
i observes its current state 
, selects an action 
, obtains a normalized reward 
 (Equation Step 3), and observes the next state 
. Its Q-table is updated as:
        where:
  is the learning rate at epoch t, scheduled to decay as  with , .
 is the discount factor, typically .
 reflects normalized energy, delay, and path security feasibility (from Steps 1–4).
| Algorithm 7 Security evaluation and trust update | 
![Mathematics 13 03196 i007 Mathematics 13 03196 i007]()  | 
Exploration-exploitation policy. Node 
i selects actions using an 
-greedy rule:
        where 
 decays over episodes, e.g., 
, with 
, 
.
 Stopping criterion and practical convergence. Unlike infinite-horizon Markov Decision Processes (MDPs), WSN agents operate under finite energy budgets and time-varying topologies. Thus, we adopt empirical convergence: training stops after  episodes or when the Q-values stabilize, i.e.,  for M consecutive episodes, with . This practical stopping rule avoids reliance on ergodicity assumptions that are unrealistic in battery-limited WSNs.
Q-table structure and memory footprint. Each Q-table has dimension , where  is the product of the discretization bins from Step 3, and  is the number of admissible neighbors. For example, with , , , , , we obtain . If the average neighbor set has , the Q-table has, at most,  entries. With each Q-value stored as a 4-byte float, the per-node memory requirement is  kB, feasible for constrained sensor platforms with ≥128 kB RAM.
Step 6: Adaptive Federated Aggregation of Q-Learning Models.
To align local policies across the WSN without transmitting raw data, we adopt FRL. Each node  periodically uploads a compressed summary of its Q-table after local training, and the sink aggregates these contributions into a global Q-model. This reduces energy overhead compared to centralized training, while ensuring robustness across heterogeneous nodes.
Aggregation model. After local training episode 
, node 
i has a Q-table 
. The sink aggregates them as:
        where 
 and 
. Thus, nodes with more reliable contributions influence the global model more.
 Adaptive weighting. We define 
 based on both residual energy and local performance:
        where:
 : residual energy of node i, consistent with the radio/battery model (Step 1).
: average return of node i’s policy, i.e., the mean episodic reward over the last  episodes (normalized to ).
: trade-off weights with .
Temporal process. Federated exchange occurs every 
 local episodes. To reduce communication cost, only nonzero entries or quantized Q-values are transmitted, incurring energy consumption
        where 
 is the payload size (bits), 
 is the distance to sink, and 
 is the expected re-transmissions.
 Local synchronization. Upon receiving the global Q-table, each node blends it with its local table using a Polyak update (to prevent catastrophic forgetting):
        This ensures stability while gradually aligning local policies with the global one.
 Stopping condition. Aggregation continues until either (i) the global Q-values stabilize (), or (ii) the average number of alive nodes drops below a threshold. Algorithm 8 presents the adaptive federated aggregation of local Q-learning models.
Step 7: Optimization of Clustering and Routing Using Hunger Games Search.
To this end, we apply a binary variant of the HGS algorithm to jointly optimize CH selection and routing paths. The resulting topology minimizes energy consumption, delay, and vulnerability, while ensuring structural feasibility.
Chromosome representation. Each candidate solution is encoded as:
A binary CH vector  where  if node i is a CH and  otherwise.
A directed binary routing matrix , where  denotes a parent-child link .
Feasibility constraints. Each solution 
 must satisfy:
        Infeasible candidates are repaired (reassigning orphan nodes) or penalized in the fitness score.
        
| Algorithm 8 Adaptive federated aggregation of Local Q-Learning Models | 
![Mathematics 13 03196 i008 Mathematics 13 03196 i008]()  | 
 Fitness function. The objective is to minimize:
        Metrics are defined as:
 HGS update mechanism (binary variant). In the continuous HGS update, candidate solutions evolve via:
        To preserve binary encoding, we apply a sigmoid mapping and Bernoulli sampling:
        then sample 
. This ensures binary Cluster Head assignments.
 The hunger factor is computed as:
        where 
 reflects the fitness-induced hunger of solution 
i.
Step 8: Integrating HGS Output with Reinforcement Learning for Faster Convergence and Structured Stability.
The binary HGS optimization yields an initial Cluster Head assignment  and a feasible routing matrix . These outputs are now used to warm-start the Q-learning agents of Step 5, ensuring that training begins from a near-optimal topology rather than from a random policy. This accelerates convergence and improves stability in the early phases of federated aggregation.
Inputs. The initialization procedure uses:
: optimized CH selection vector.
: optimized directed routing matrix.
: residual energy of all nodes (Step 1).
: per-node security index (Step 4).
Q-table initialization. For each node 
i, the Q-table entries corresponding to admissible actions 
 are initialized as:
        where 
 are bounded constants (e.g., 
, 
). This ensures that routes favored by HGS start with higher expected utility.
 Policy initialization. The initial decision policy is biased toward the HGS solution:
        where 
 allows limited exploration from the start. This avoids purely deterministic choices and ensures that suboptimal HGS assignments can still be corrected.
 Effect on convergence. With HGS-guided initialization, the number of training episodes needed to reach a near-optimal policy is reduced. There exists 
 such that:
        meaning that the hybrid HGS-RL process attains an 
-close approximation of the optimal Q-function faster than a random-start Q-learning agent. This warm-start effect improves energy efficiency by reducing the number of early, exploratory transmissions that would otherwise consume scarce battery resources.
 Step 9: Adaptive Re-Aggregation of Learning Policies after HGS-RL Convergence.
After nodes complete their local HGS-guided Q-learning episodes, a second federated aggregation is performed to refine and synchronize policies. This re-aggregation stage ensures that energy-depleted nodes do not dominate decisions, while reinforcing secure and reliable routes across the network.
Inputs. Each node  provides:
Its locally updated Q-table , which reflects both RL training and HGS initialization.
Residual energy .
Reliability score  defined as the Packet Delivery Ratio (PDR) over the last  transmissions.
Security index .
Adaptive weighting. The global aggregation weights are computed as:
        with 
. This ensures balanced contributions from energy-rich, reliable, and secure nodes.
 Global aggregation. The sink computes the new federated Q-function:
 Local synchronization. Instead of overwriting local tables, each node blends the global update into its own Q-table:
        This prevents catastrophic forgetting and ensures smoother convergence. Algorithm 9 presents the adaptive re-aggregation of HGS-RL Q-tables.
        
| Algorithm 9 Adaptive re-aggregation of HGS-RL Q-tables | 
![Mathematics 13 03196 i009 Mathematics 13 03196 i009]()  |