A Lightweight Double-Deep Q-Network for Energy Efficiency Optimization of Industrial IoT Devices in Thermal Power Plants

Gao, Shuang; Zou, Yuntao; Feng, Li

doi:10.3390/electronics14132569

Open AccessArticle

A Lightweight Double-Deep Q-Network for Energy Efficiency Optimization of Industrial IoT Devices in Thermal Power Plants

by

Shuang Gao

¹,

Yuntao Zou

²

and

Li Feng

^3,*

¹

School of Computer Science and Engineering, Macau University of Science and Technology, Macau 999078, China

²

School of Energy and Power Engineering, Huazhong University of Science and Technology, Wuhan 430074, China

³

Macau University of Science and Technology Zhuhai MUST Science and Technology Research Institute, Zhuhai 519031, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(13), 2569; https://doi.org/10.3390/electronics14132569

Submission received: 22 May 2025 / Revised: 17 June 2025 / Accepted: 22 June 2025 / Published: 25 June 2025

(This article belongs to the Topic Advanced Propagation Channel Estimation Techniques for Sixth-Generation (6G) Wireless Communications)

Download

Browse Figures

Versions Notes

Abstract

Industrial Internet of Things (IIoT) deployments in thermal power plants face significant energy efficiency challenges due to harsh operating conditions and device resource constraints. This paper presents gradient memory double-deep Q-network (GM-DDQN), a lightweight reinforcement learning approach for energy optimization on resource-constrained IIoT devices. At its core, GM-DDQN introduces the gradient memory mechanism, a novel memory-efficient alternative to experience replay. This core innovation, combined with a simplified neural network architecture and efficient parameter quantization, collectively reduces memory requirements by 99% and computation time by 85–90% compared to standard methods. Experimental evaluations across three realistic simulated thermal power plant scenarios demonstrate that GM-DDQN improves energy efficiency by 42% compared to fixed policies and 27% compared to threshold-based approaches, extending battery lifetime from 8–9 months to 14–15 months while maintaining 96–97% PSR. The method enables sophisticated reinforcement learning directly on IIoT edge devices without requiring cloud connectivity, reducing maintenance costs and improving monitoring reliability in industrial environments.

Keywords:

industrial internet of things; energy efficiency; reinforcement learning; resource-constrained devices; thermal power plants; edge intelligence; deep Q-network

1. Introduction

The rapid evolution of the Industrial Internet of Things (IIoT) is revolutionizing modern industrial operations, especially within critical infrastructures like thermal power plants [1,2,3,4,5,6,7,8,9]. These complex environments increasingly rely on the IIoT for continuous monitoring, predictive maintenance, and real-time operational optimization [10,11,12,13]. However, deploying wireless sensor networks (WSNs) in such demanding settings introduces significant hurdles, primarily in achieving energy efficiency without compromising reliable communication [14,15,16]. To address this critical gap, this paper introduces a novel resource-aware deep reinforcement learning (DRL) framework designed to optimize energy consumption for IIoT devices in thermal power plants.

1.1. Background

The advent of Industry 4.0 has positioned IIoT technology as a cornerstone for modern thermal power plant operations, extensively utilized for equipment condition-monitoring, predictive maintenance, and real-time control [17,18,19,20,21]. Within these critical infrastructures, the operational status of core machinery, such as boilers and turbines, directly dictates power generation efficiency, systemic safety, and economic viability. IIoT sensors, acting as crucial conduits between physical assets and digital analytical systems, are thus indispensable for the performance and reliability of the entire power generation ecosystem.

Nevertheless, sensors deployed in these extreme environments—characterized by high temperatures often ranging from 100 to 500 °C, high pressures up to 10 to 30 MPa, strong electromagnetic interference (EMI) from power equipment, and arduous installation conditions—face substantial operational challenges [22]. The energy efficiency of battery-powered sensors in such settings is a critical determinant of device lifespan, maintenance expenditure, and overall system reliability. For example, a modest 10% reduction in battery life due to inefficient energy management in a plant with several hundred sensors can escalate annual replacement costs by tens of thousands of dollars [3,23]. More critically, energy depletion leading to monitoring failures can prevent the timely detection of faults in essential equipment. Such failures can precipitate equipment shutdowns, incurring hourly losses that can range from tens to hundreds of thousands of dollars, and, in severe cases, trigger power grid instabilities with potential economic impacts in the millions [24].

Conventional energy management techniques, including fixed power allocation and rudimentary threshold-based sleep strategies, exhibit marked limitations within the dynamic and complex milieu of thermal power plants. Fixed power strategies, for instance, are inherently unable to adapt to fluctuating wireless channel conditions. This rigidity can lead to an estimated 30% increase in energy consumption due to packet retransmissions in high-interference scenarios, or, conversely, energy wastage under benign channel conditions. Similarly, static sleep strategies are ill-equipped to predict and react to sporadic critical events, such as abrupt increases in equipment vibration. Consequently, they risk missing vital data transmissions or incurring unnecessary energy expenditure, potentially up to 50% of standby power, from premature wake-ups during quiescent periods [25,26]. Such inefficiencies culminate in curtailed battery lifespans: for example, a reduction from 12 to 9 months, inflated maintenance overheads, and compromised reliability and safety of the power generation system.

Consequently, maximizing the energy efficiency of IIoT devices—while concurrently ensuring reliable data transmission and prompt responsiveness—emerges as a pivotal technical challenge for the intelligent and sustainable operation of thermal power plants. Addressing this challenge transcends mere engineering concerns of extending device operational lifespans; it is intrinsically linked to enhancing the resilience and economic performance of the entire power system, thereby holding substantial practical and theoretical importance.

1.2. Key Challenges in Energy Efficiency Optimization

Optimizing the energy efficiency of IIoT deployments in thermal power plants is confronted by several formidable challenges:

Environmental Complexity and Dynamics: Thermal power plant environments are inherently non-stationary. IIoT devices must contend with pronounced wireless channel variations stemming from EMI generated by power equipment, signal reflections from metallic structures, and fluctuating operational loads that might vary, for instance from 50% to 100%. These factors can cause the signal-to-noise ratio (SNR) to plummet from a stable 25 dB to as low as 5 dB within brief intervals. Concurrently, critical parameters such as boiler vibrations or turbine speeds can exhibit erratic changes: for example, the vibrations might escalate from a baseline 20 Hz to 60 Hz during an anomaly, demanding real-time adaptive capabilities from sensor nodes. Such pronounced dynamism renders static optimization strategies largely ineffective in maintaining peak performance [27,28,29].
Multi-Objective Trade-offs: Critical monitoring applications in thermal power plants necessitate a delicate balance between energy efficiency, system reliability exemplified by data delivery success rates, and responsiveness characterized by low latency. An overemphasis on energy conservation can inadvertently degrade data transmission success rates or protract response times. In high-stakes scenarios, such as the early detection of equipment faults, these compromises can lead to severe consequences; for instance, the delayed detection of a turbine bearing failure could result in hourly losses amounting to hundreds of thousands of dollars. Optimization algorithms must therefore navigate these competing objectives to achieve a holistically optimal operational balance [30,31,32,33].
Resource-Constrained Edge Devices: IIoT sensors in power plants are typically embedded systems with stringent resource limitations, often featuring microcontrollers (MCUs) like the STM32 series with modest processing power around 48 MHz and limited memory of approximately 32 KB of RAM and 256 KB of flash. In stark contrast, conventional DRL algorithms often demand considerable computational and memory resources, rendering them unsuitable for direct deployment on such edge devices. Consequently, a pivotal challenge lies in designing energy efficiency optimization algorithms that are not only effective but also sufficiently lightweight to operate within these constraints without significant performance degradation [34,35].
Perceptual and Decisional Uncertainty: Sensor-derived perceptions of the operating environment, including channel quality and equipment vibration levels, are often subject to noise and inherent latencies. Decisions predicated on such imperfect information can, in turn, influence future state observations, creating intricate feedback loops. Furthermore, the unpredictable nature of power plant load variations and dynamic interference patterns introduces a high degree of uncertainty into the decision-making process, demanding robust algorithmic solutions [36,37].

1.3. Our Contributions

To address the aforementioned challenges in optimizing energy efficiency for IIoT devices within thermal power plants, this paper proposes a novel resource-aware double-deep Q-network (GM-DDQN) method. The primary contributions of this work are as follows:

A Novel Memory-Efficient Learning Mechanism: Gradient Memory. Our primary contribution is the proposal and formalization of the gradient memory mechanism, a conceptual alternative to experience replay designed specifically for memory-constrained edge learning. By storing and reusing a compact history of loss gradients, it fundamentally addresses the memory bottleneck of on-device DRL, paving the way for enhanced on-device intelligence in IIoT networks.
A Holistic Lightweight DRL Framework for Edge Deployment. Building upon our core innovation, we design and implement the complete GM-DDQN framework. It integrates the gradient memory mechanism with complementary techniques, including a streamlined neural network architecture, parameter quantization, and efficient parameter update strategies. This results in a system that is not only theoretically novel but also practically deployable, drastically cutting computational overhead and inference time while preserving robust learning and decision-making capabilities.
Joint Adaptive Wireless Communication and Sleep Scheduling. We introduce a novel approach that formulates transmission power control and sleep mode management as a unified joint decision-making problem. The GM-DDQN agent learns to dynamically coordinate these actions based on real-time environmental states, including SNR, equipment vibration intensity, and network traffic. This holistic management demonstrably achieves substantial energy savings, experimentally shown to be between 22% and 42%, directly enhancing device longevity.
Multi-Objective Optimization Tailored for Industrial Needs. To meet the multifaceted operational requirements of thermal power plants, we designed a novel multi-objective reward function. This function intrinsically balances three critical performance indicators: energy efficiency, data transmission reliability (success rate), and communication latency. Through carefully assigned weighting factors, the GM-DDQN agent is trained to maximize energy efficiency while stringently maintaining data success rates above 95% and communication delays below a 400 ms threshold.
Rigorous Validation in Realistic Simulated Scenarios. The efficacy of the proposed GM-DDQN framework is demonstrated through extensive and rigorous experimental evaluations. We systematically benchmark GM-DDQN against standard DRL algorithms (DDQN, DDPG, and PPO) and conventional strategies across diverse and realistic simulated thermal power plant scenarios. The results substantiate that GM-DDQN achieves performance comparable to state-of-the-art DRL methods in terms of energy efficiency and data reliability while operating with only a fraction of their computational resource footprint. This comprehensive validation underscores the practical viability of GM-DDQN for large-scale IIoT deployments.

2. Related Work

This section reviews the energy efficiency optimization of IIoT devices, applications of reinforcement learning in resource management, and lightweight deep reinforcement learning techniques for resource-constrained devices. We analyze the limitations in the existing research and explain how our proposed GM-DDQN method addresses these gaps.

2.1. Energy Efficiency Optimization in IIoT

The Industrial Internet of Things, as a core technology of Industry 4.0, is transforming traditional manufacturing. Hu et al. [6] proposed a five-layer architecture for IIoT intelligence empowering smart manufacturing, indicating that energy optimization is one of the important contributions of IIoT intelligence. In this context, energy efficiency of IIoT devices has become a core challenge affecting long-term reliable system operation.

Yadav et al. [38] reviewed various industrial energy optimization methods, including optimization algorithms based on simulated annealing, genetic algorithms, and SARSA-based sleep mode management, which effectively balance energy conservation and service quality. Hou et al. [39] developed a thermal energy harvesting WSN node that converts industrial waste heat into electrical energy using thermoelectric generators, determining a minimum sleep cycle with a 5.4% duty cycle to achieve indefinite operation. Farné et al. [40] demonstrated how to utilize industrial Ethernet infrastructure for real-time monitoring of industrial equipment efficiency without additional monitoring systems.

In wireless communication energy efficiency, Pradhan et al. [41] proposed a joint optimization method for packet length and power allocation to maximize secure energy efficiency in URLLC-oriented IIoT environments. Solati et al. [42] studied energy efficiency in UAV-assisted IIoT networks, combining beamforming and NOMA techniques to solve non-convex optimization problems. Zhou et al. [43] proposed an energy efficiency optimization framework for power line inspection using IIoUAVs in smart grids, jointly considering variables at large and small time scales, designing a two-stage algorithm combining dynamic programming, auction theory, and matching theory. Jiang et al. [44] proposed an energy-efficient networking approach for IIoT cloud services, jointly optimizing distributed data centers and cloud network energy efficiency with a heuristic algorithm combining a niche genetic algorithm and random depth-first search.

In energy system optimization, Mamaghani et al. [45] conducted multi-objective optimization of HT-PEM fuel cell-based micro combined heat and power systems, analyzing the trade-offs between different performance indicators under full- and partial-load conditions. Ma et al. [46] studied multi-objective performance optimization for gas turbine part-load operation, and Kumar [47] conducted a comprehensive 4-E analysis of thermal power plants, identifying key energy loss locations. Qu et al. [48] proposed a dual-region topology optimization method to improve thermal cycle efficiency in nuclear power plants, achieving approximately a 240% efficiency improvement. Liu et al. [49] conducted thermal economy analysis and multi-objective optimization of a CO₂ transcritical pumped thermal electricity storage system, increasing the energy utilization rate to 97.66%. Cacciali et al. [50] investigated thermal energy integration and optimization in compressed air energy storage systems, proposing solutions for solid and liquid TES systems.

While these traditional optimization methods are effective under specific conditions, they often rely on predefined rules or simplified models and struggle to adapt to dynamic changes in complex industrial environments like thermal power plants, prompting researchers to explore more advanced methods.

2.2. Reinforcement Learning-Based Energy and Resource Management

Deep reinforcement learning, with its ability to learn optimal strategies through interaction without requiring precise environment models, provides a powerful tool for optimizing complex dynamic systems. Chen et al. [36] comprehensively reviewed DRL applications in IoT, highlighting its potential in addressing communication, computation, caching, and control challenges, especially emphasizing its value in smart grid energy management.

Zhang et al. [51] proposed an intelligent resource adaptation scheme for diversified service requirements in IIoT, designing a KODDQN algorithm to solve the optimization problem of maximizing long-term key quality indicators (KQIs). Dridi et al. [52] proposed a DRL method based on an RNN architecture for microgrid distributed energy management, demonstrating superior performance and learning speed compared to traditional methods. Dolatabadi et al. [53] proposed a completely model-free DDPG framework for optimizing PV-integrated energy hub scheduling, analyzing DQN’s limitations in handling continuous action spaces and DDPG’s advantages.

Furthermore, deep reinforcement learning and related machine learning techniques have been widely applied to solve diverse resource and energy optimization problems, such as in mobile edge computing for joint task offloading and resource allocation with energy harvesting [54], in virtualized radio access networks for optimizing network function placement using graph neural networks [55], and even in logistics for enhancing time and energy efficiency in machine learning-driven truck–drone collaborative deliveries [56]. Similarly, deep Q-learning has been utilized to develop efficient UAV detouring algorithms for energy-minimized data collection and sensor recharging within constrained wireless sensor network environments [57].

While these studies demonstrate DRL’s potential in energy and resource management, their high computational and memory requirements make direct deployment on resource-constrained IIoT devices challenging, highlighting the necessity of developing lightweight DRL methods.

2.3. Lightweight DRL Methods and Edge Intelligence

Traditional DRL methods’ high resource demands fundamentally conflict with IIoT devices’ resource constraints. Chen et al. [36] emphasized the concept of edge intelligence—deploying AI at network edges—highlighting benefits including reduced latency, enhanced security, and improved energy efficiency while acknowledging the significant challenges in implementing complex DRL algorithms on resource-constrained edge devices.

Singh et al. [58] proposed a compact IIoT system for remote monitoring of industrial facilities, integrating real-time data analysis, predictive maintenance, and energy optimization algorithms, emphasizing the importance of local data processing in edge computing environments. Dridi et al. [52] also identified computational and data requirements of reinforcement learning, suggesting further simplification and optimization for resource-constrained devices.

The existing lightweight DRL research primarily focuses on model compression, knowledge distillation, and efficient network architectures but lacks optimization specifically for industrial scenarios like thermal power plants, particularly in multi-objective optimization problems involving transmission power adjustment and sleep mode management.

2.4. Research Gaps and Contributions

Based on a comprehensive analysis of the existing research, we identify several key research gaps: First, the existing IIoT energy efficiency optimization methods often rely on predefined rules or computationally intensive algorithms, struggling to balance adaptability and resource constraints. The traditional methods by Yadav et al. [38] lack environmental adaptability; Ma et al. [46] focused on industrial equipment rather than monitoring devices; Hou et al. [39]’s fixed sleep cycle strategy lacks adaptability; Farné et al. [40] primarily optimized monitored equipment rather than monitoring devices themselves; and the methods by Pradhan et al. [41] and Zhou et al. [43] have high computational complexity, making implementation on resource-constrained devices difficult. Second, although Dridi et al. [52], Dolatabadi et al. [53], and Zhang et al. [51] demonstrated DRL’s potential in energy management, their high computational requirements limit application on resource-constrained IIoT devices. Singh et al. [58], while considering resource constraints, lacked adaptive learning mechanisms for dynamic environments. Third, the existing lightweight DRL methods lack optimization specifically for industrial scenarios like thermal power plants, unable to effectively address their complex environmental characteristics and multi-objective optimization needs.

To address these gaps, our proposed GM-DDQN method offers the following innovative contributions:

Lightweight Deep Reinforcement Learning Framework: We propose a GM-DDQN approach that is suitable for embedded systems, significantly reducing the resource requirements while maintaining decision-making performance through a simplified network structure, a novel gradient memory mechanism that replaces experience replay, and streamlined update strategies.

Adaptive Wireless Communication Strategy: We innovatively treat transmission power adjustment and sleep mode management as a joint decision problem, dynamically coordinating both based on the environmental states—increasing power and maintaining activity during high-interference and high-load periods while reducing power and entering deep sleep during stable low-load periods.

Multi-Objective Optimization for Industrial IoT: We design a multi-objective reward function specifically for thermal power plants, organically combining energy efficiency, data success rate, and communication delay, allowing weight adjustments to maximize energy efficiency while ensuring high reliability and low latency.

The experimental results show that GM-DDQN performs comparably to more complex algorithms while consuming only about one-twentieth of the computational resources, providing a practical solution for the large-scale deployment of IIoT systems in thermal power plants, demonstrating significant engineering application value.

3. Problem Modeling and Environment Definition

This section formalizes the energy efficiency optimization problem for IIoT devices operating within thermal power plants. We first detail the system model and delineate the specific environmental characteristics of these industrial settings. Subsequently, we formulate the optimization task as a Markov Decision Process (MDP), providing a framework for applying reinforcement learning.

3.1. System Model

We consider an IIoT network deployed within a thermal power plant, comprising multiple battery-powered sensor nodes. These nodes are responsible for collecting environmental and equipment data and transmitting it to a central gateway for further processing or decision-making. The overall system architecture, illustrating the interaction between sensor nodes, the IIoT environment, and the decision optimization agent, is depicted in Figure 1.

Each sensor node

i \in N = {1, 2, . . ., N}

is defined by several key attributes:

Maximum battery capacity ( $B_{i}^{m a x}$ in mAh) and current battery level ( $B_{i} (t)$ ).
A discrete set of available transmission power levels ( $P_{i} = {p_{i}^{1}, . . ., p_{i}^{K}}$ in dBm).
A discrete set of available sleep modes ( $S_{i} = {s_{i}^{1}, . . ., s_{i}^{L}}$ ), each mode $s_{i}^{l}$ having a specific power consumption rate ( $e_{i}^{l}$ in mW) and wake-up delay ( $d_{i}^{l}$ in ms).
An average data generation rate ( $λ_{i}$ in packets/s) and processing capability ( $C_{i}$ in MIPS).

The wireless communication channel between node i and the gateway is influenced by time-varying path loss

P L_{i} (t)

, constant channel noise power

N_{0}

, and interference

I_{i} (t)

. The signal-to-noise-plus-interference ratio (SINR),

γ_{i} (t)

, is crucial for communication quality and is given by

γ_{i} (t) = \frac{p_{i} (t) \cdot P L_{i} (t)}{N_{0} + I_{i} (t)}

(1)

where

p_{i} (t) \in P_{i}

is the selected transmission power. A detailed temperature-influenced model for

P L_{i} (t)

is presented in Section 3.2.3.

The Packet Success Rate (PSR),

P S R_{i} (t)

, representing the probability of successful data transmission, is modeled as a function of

γ_{i} (t)

:

P S R_{i} (t) = \{\begin{matrix} 1 - exp (- k \cdot (γ_{i} (t) - γ_{t h})), & if γ_{i} (t) > γ_{t h} \\ 0, & otherwise \end{matrix}

(2)

Here,

γ_{t h}

is the minimum SINR for successful reception, and k is a constant related to the modulation and coding scheme.

3.2. Thermal Power Plant Environment Characteristics

The IIoT operational environment in thermal power plants is characterized by harsh and dynamic conditions impacting sensor performance and energy consumption.

3.2.1. Electromagnetic Interference Patterns

Electromagnetic interference (EMI),

I_{i} (t)

, is a significant factor composed of three distinct components: a stable background level, predictable cyclic noise, and sporadic bursts. This can be modeled as

I_{i} (t) = I_{base} + I_{var} (t) + I_{burst} (t)

(3)

Here,

I_{base}

represents the persistent baseline EMI from distant power lines and general industrial noise.

I_{burst} (t)

captures sporadic high-intensity interference from transient events like large motor startups or circuit breaker switching.

The term

I_{var} (t)

models the predictable time-varying interference linked to the cyclical operation of heavy machinery, such as turbines and generators. It can be represented as a superposition of sinusoidal waves:

I_{var} (t) = A \cdot sin (2 π f_{1} t) + B \cdot sin (2 π f_{2} t + ϕ)

(4)

In this model, amplitudes A and B and frequencies

f_{1}

and

f_{2}

correspond to the interference strength and fundamental operating frequencies of different major equipment (e.g., a 50/60 Hz generator and a variable-frequency drive motor), while

ϕ

represents a phase offset.

3.2.2. Equipment Vibration and Monitoring Needs

Equipment vibration intensity,

V_{i} (t)

, is a key indicator of machinery health. We model it as the sum of a normal baseline, predictable fluctuations, and potential anomaly signals:

V_{i} (t) = V_{normal} + V_{fluctuation} (t) + V_{anomaly} (t)

(5)

Here,

V_{normal}

represents the baseline vibration of a machine operating in a healthy steady state.

V_{fluctuation} (t)

models the regular load-dependent variations in vibration that occur as the power plant’s output changes throughout the day. Critically,

V_{anomaly} (t)

signifies sharp, irregular, and significant deviations from the norm, which could indicate an impending equipment fault, such as bearing wear or blade imbalance, requiring immediate attention.

3.2.3. Temperature and Path Loss Variations

Ambient temperature

T_{i} (t)

at a sensor’s location is dynamic, affected by machinery and diurnal cycles:

T_{i} (t) = T_{b a s e} + Δ T_{o p e r a t i o n} (t) + Δ T_{d a i l y} (t)

(6)

Temperature influences wireless path loss

P L_{i} (t)

, which is modeled as Equation (7):

P L_{i} (t) = P L_{0} \cdot {(\frac{d_{i}}{d_{0}})}^{- α} \cdot (1 + β \cdot (T_{i} (t) - T_{r e f}))

(7)

where

P L_{0}

is reference path loss at distance

d_{0}

,

d_{i}

is node–gateway distance,

α

is path loss exponent,

β

is temperature coefficient, and

T_{r e f}

is reference temperature.

These environmental factors (EMI, vibration, and temperature) are often interconnected, creating complex coupled effects on sensor performance and energy use. For instance, increased plant load can simultaneously elevate EMI, vibration, and local temperatures.

3.3. Energy Consumption Model

The total energy consumed by sensor node i during a time step t,

E_{i} (t)

, is the sum of:

E_{i} (t) = E_{i}^{t x} (t) + E_{i}^{r x} (t) + E_{i}^{p r o c} (t) + E_{i}^{s l e e p} (t)

(8)

representing energy for transmission, reception, processing, and sleep modes, respectively.

Transmission energy

E_{i}^{t x} (t)

for

N_{i}^{t x} (t)

packets of size L bits, at data rate

R_{i} (t)

and power

p_{i} (t)

, is

E_{i}^{t x} (t) = p_{i} (t) \cdot \frac{L \cdot N_{i}^{t x} (t)}{R_{i} (t)}

(9)

This is the energy for transmission attempts; effective energy per successfully delivered bit increases if

P S R_{i} (t) < 1

due to retransmissions.

Reception energy

E_{i}^{r x} (t)

and processing energy

E_{i}^{p r o c} (t)

(dependent on

C_{i}

and processing tasks) are considered. For modeling simplicity in this work, these are assumed to be relatively constant or subsumed into an average active-mode power consumption when not transmitting or in deep sleep, allowing focus on the dominant controllable terms

E_{i}^{t x} (t)

and

E_{i}^{s l e e p} (t)

.

Sleep mode energy

E_{i}^{s l e e p} (t)

for a chosen mode

s_{i} (t) \in S_{i}

is

E_{i}^{s l e e p} (t) = e_{i}^{s_{i} (t)} \cdot T_{i}^{s l e e p} (t)

(10)

where

e_{i}^{s_{i} (t)}

is the mode’s power rate and

T_{i}^{s l e e p} (t)

is its duration.

3.4. MDP Formulation

The energy efficiency optimization is formulated as an MDP, defined by

(S, A, P, R, γ)

, where the IIoT device (agent) learns to minimize long-term energy use while meeting performance criteria.

3.4.1. State Space ( $S$ )

The state

s_{t} \in S

at time t provides the agent with essential context for decision-making:

s_{t} = 〈 γ_{i} (t), B_{i} (t), V_{i} (t), Q_{i} (t), T_{i} (t) 〉

(11)

This vector comprises current SINR (

γ_{i} (t)

), remaining battery (

B_{i} (t)

), equipment vibration (

V_{i} (t)

), data queue length (

Q_{i} (t)

), and ambient temperature (

T_{i} (t)

). These variables are selected for their direct influence on communication, energy, and data relevance.

3.4.2. Action Space ( $A$ )

The action

a_{t} \in A

at time t is a single choice from a discrete action space. This space is constructed as the Cartesian product of the set of available transmission power levels

P_{i}

and the set of available sleep modes

S_{i}

. Each action is therefore a composite tuple that jointly determines both control variables:

a_{t} = 〈 p_{i} (t), s_{i} (t) 〉, where p_{i} (t) \in P_{i} and s_{i} (t) \in S_{i}

(12)

For instance, with the 4 power levels and 3 sleep modes defined in our experiments, the agent selects one action from a total of

| A | = 4 \times 3 = 12

possible discrete actions at each time step. This formulation transforms the multi-dimensional control problem into a unified single-dimensional decision task. Consequently, the action space remains small and computationally tractable, directly avoiding the combinatorial explosion that would arise from treating each control variable independently.

3.4.3. Transition Probability ( $P$ )

The transition probability

P (s_{t + 1} | s_{t}, a_{t})

dictates the likelihood of moving from state

s_{t}

to

s_{t + 1}

given action

a_{t}

. It is governed by environmental dynamics (Section 3.2) and action consequences. In our model-free DRL approach,

P

is learned implicitly through environmental interaction.

3.4.4. Reward Function ( $R$ )

The reward function

R (s_{t}, a_{t})

guides the agent by quantifying the desirability of actions, balancing multiple objectives:

R (s_{t}, a_{t}) = w_{1} R_{e n e r g y} + w_{2} R_{r e l i a b i l i t y} + w_{3} R_{l a t e n c y}

(13)

Components (for state

s_{t}

and action

a_{t}

):

Energy: $R_{e n e r g y} = - E_{i} (t) / E_{m a x}$ , penalizing normalized energy consumption.
Reliability: $R_{r e l i a b i l i t y} = P S R_{i} (t)$ , rewarding successful transmissions.
Latency: $R_{l a t e n c y}$ is defined as

$R_{l a t e n c y} = \{\begin{matrix} 1, & if D_{i} (t) \leq D_{t h} \\ 1 - \frac{D_{i} (t) - D_{t h}}{D_{m a x} - D_{t h}}, & if D_{t h} < D_{i} (t) < D_{m a x} \\ 0, & if D_{i} (t) \geq D_{m a x} \end{matrix}$

(14)

rewarding timely data delivery relative to thresholds $D_{t h}$ and $D_{m a x}$ .

Weights

w_{1} = 0.5, w_{2} = 0.4, w_{3} = 0.1

(summing to 1) prioritize energy saving and reliability, reflecting typical industrial monitoring needs. These were set based on preliminary experiments and domain knowledge; their sensitivity is a subject for future work.

3.4.5. Discount Factor ( $γ$ )

A discount factor

γ = 0.95

is used, valuing future rewards to encourage long-term energy-efficient behavior.

3.5. Optimization Objective

The goal is to find an optimal policy

π^{*} : S \to A

that maximizes the expected sum of discounted future rewards:

π^{*} = arg max_{π} E [\sum_{k = 0}^{\infty} γ^{k} R (s_{t + k}, a_{t + k}) | s_{t}, a_{t + k} = π (s_{t + k})]

(15)

This policy guides the IIoT node to make optimal power and sleep mode choices to balance energy, reliability, and latency.

3.6. Problem Complexity Analysis

Optimizing energy efficiency in this context is challenging due to the following:

High-Dimensional and Continuous States: The state space is vast, making tabular RL methods impractical.
Non-Stationary Environment: Plant dynamics change, requiring adaptive policies.
Delayed Rewards and Temporal Credit Assignment: The long-term effects of actions are hard to attribute.
Complex Objective Trade-offs: Balancing energy, reliability, and latency is non-trivial.
IIoT Device Resource Constraints: Algorithms must be lightweight and computationally efficient.

These complexities necessitate an intelligent, adaptive, and resource-efficient approach like the proposed GM-DDQN, capable of learning effective policies from direct interaction with the dynamic environment.

4. Lightweight Double-Deep Q-Network (GM-DDQN) Method

This section details our proposed gradient memory double-deep Q-network (GM-DDQN), an approach specifically engineered for energy efficiency optimization on resource-constrained IIoT devices. We begin with an overview of the GM-DDQN architecture, followed by a discussion of its core lightweighting components, state and action representations, training methodology, practical implementation considerations for IIoT deployment, and a brief theoretical analysis.

4.1. GM-DDQN Architecture Overview

The GM-DDQN framework adapts the conventional double-deep Q-network (DDQN) by incorporating several key modifications aimed at drastically reducing computational and memory demands without substantial performance degradation. The overall architecture, as illustrated in Figure 2, highlights the interaction between the input state, the lightweight Q-networks (main and target), and the novel gradient memory mechanism.

Distinct from standard DDQN, which typically employs two identically structured, often large, neural networks, GM-DDQN utilizes a significantly more compact network design. Furthermore, it replaces the conventional large experience replay buffer with a “gradient memory” mechanism, designed for efficient learning from a smaller processed representation of recent experiences.

4.2. The Core Innovation: The Gradient Memory Mechanism

The conceptual cornerstone of GM-DDQN is the gradient memory mechanism, a novel and memory-efficient alternative to the conventional experience replay buffer that forms the primary memory bottleneck in standard DRL algorithms. Instead of storing full

(s, a, r, s^{'})

transitions, which is infeasible on resource-constrained devices, our approach maintains a compact cyclical buffer of the most recent loss gradients.

The loss function

L (θ_{t})

at time t is the standard DDQN loss:

L (θ_{t}) = E_{(s, a, r, s^{'}) \sim D_{t}} [{(y_{t} - Q (s, a; θ_{t}))}^{2}]

(16)

where

y_{t} = r + γ {max}_{a^{'}} Q (s^{'}, a^{'}; θ_{t}^{-})

is the target value, with

θ_{t}^{-}

being the parameters of the target network. The gradient

g_{t} = \nabla_{θ} L (θ_{t})

is computed.

A gradient memory matrix

G \in R^{m \times | θ |}

stores the last m gradients (typically

m = 8

in our setup, where

| θ | = 572

is the number of parameters). G is updated cyclically:

G [t mod m] = g_{t}

.

The Q-network parameters

θ

are then updated using a weighted average of these stored gradients:

θ_{t + 1} = θ_{t} - α \cdot \sum_{j = 0}^{m - 1} w_{j} \cdot G [j]

(17)

where

α

is the learning rate. The weights

w_{j}

prioritize more recent gradients, calculated as

w_{j} = \frac{exp (- λ \cdot ((t mod m) - j mod m))}{\sum_{k = 0}^{m - 1} exp (- λ \cdot ((t mod m) - k mod m))}

(18)

with a decay parameter

λ = 0.5

. This mechanism drastically reduces memory requirements compared to experience replay, as shown in Table 1.

The total memory reduction is approximately 99%, rendering GM-DDQN highly suitable for memory-constrained IIoT devices.

4.3. Supporting Lightweighting Techniques

To fully realize the benefits of the gradient memory mechanism and create a deployable on-device framework, we complement our core innovation with two established lightweighting strategies.

4.3.1. Compact Neural Network Architecture and Quantization

The neural network in GM-DDQN is a multi-layer perceptron (MLP) with a reduced number of layers and neurons compared to typical DRL models. For a given state

s_{t}

, the Q-value

Q (s_{t}, a)

for each action a is computed as

\begin{matrix} h_{1} & = σ (W_{1} s_{t} + b_{1}) \\ h_{2} & = σ (W_{2} h_{1} + b_{2}) \\ Q (s_{t}, a) & = W_{3} h_{2} + b_{3} \end{matrix}

(19)

where

s_{t} \in R^{n_{s}}

is the state vector (with

n_{s} = 5

as per Section 3.4.1),

σ

is the ReLU activation function,

h_{1}, h_{2} \in R^{n_{h}}

are hidden layers (with

n_{h} = 16

), and

W_{k}, b_{k}

are the corresponding weight matrices and bias vectors. The output layer has

| A |

neurons, where

| A | = 12

(4 power levels × 3 sleep modes) is the dimension of our discrete action space.

This streamlined structure leads to a significant parameter reduction. Table 2 compares the parameter count of a representative standard DDQN (e.g., with

n_{h} = 64

) against our GM-DDQN.

To further minimize the computational and memory footprint, 8-bit quantization is applied to the network parameters (

W_{k}, b_{k}

). The quantized weight

W_{q}

is obtained from W using

W_{q} = round (\frac{W - min (W)}{max (W) - min (W)} \times (2^{8} - 1))

(20)

Inference then utilizes fixed-point arithmetic: for instance, calculating a quantized hidden layer activation

h_{q}

:

h_{q} = clip (round (\frac{W_{q} \cdot s_{q} + b_{q}}{S}), 0, 255)

(21)

where

s_{q}, b_{q}

are quantized inputs and biases, and S is a scaling factor determined during quantization. This 89.3% parameter reduction, coupled with quantization, significantly lowers memory needs and accelerates inference, crucial for IIoT deployment.

4.3.2. Efficient Target Network Updates

Standard DDQN involves periodically copying all parameters from the main Q-network to the target network

θ^{-}

. GM-DDQN employs a more computationally frugal soft update mechanism:

θ_{t}^{-} = β \cdot θ_{t} + (1 - β) \cdot θ_{t - 1}^{-}

(22)

where

β

is a small update coefficient (e.g.,

β = 0.05

), gradually blending the main network’s parameters into the target network.

4.4. State and Action Representation

Effective learning requires appropriate representation of states and actions.

4.4.1. State Normalization

Input state components (SINR

γ_{i} (t)

, battery

B_{i} (t)

, vibration

V_{i} (t)

, queue

Q_{i} (t)

, and temperature

T_{i} (t)

) are normalized to the range

[0, 1]

to enhance learning stability and efficiency. For example, for SINR,

γ_{i}^{n o r m} (t) = \frac{γ_{i} (t) - γ_{m i n}}{γ_{m a x} - γ_{m i n}}

(23)

Similar normalization is applied to other components using their respective predefined min/max values or thresholds (e.g.,

B_{i}^{m a x}

for battery and

Q_{t h}

for queue length).

4.4.2. Action Discretization

As established in the problem formulation (Section 3.4.2), the agent’s actions are composite choices of transmission power and sleep mode. To make this tractable for our DRL algorithm, these choices are handled within a single flat action space. We define a set of four power levels (e.g.,

{0, 5, 10, 15}

dBm) and three sleep modes (e.g., active, light sleep, and deep sleep), resulting in a cardinality of

| A | = 4 \times 3 = 12

discrete actions. For instance, action index 0 could map to the pair {0 dBm, deep sleep}, while action index 11 could map to {15 dBm, active}. At each time step, the agent’s neural network outputs Q-values for all 12 actions, and the agent selects the single action with the highest value. This approach ensures the action space remains computationally tractable while providing sufficient control granularity.

4.5. Training Procedure

The GM-DDQN is trained offline using the procedure outlined in Algorithm 1. Key parameters include a learning rate

α = 0.001

, discount factor

γ = 0.95

, and an exploration probability

ϵ

that anneals from 1.0 to 0.05 during training to balance exploration and exploitation.

4.6. Implementation Considerations for IIoT Devices

Deploying GM-DDQN on IIoT devices necessitates careful optimization of memory and computation:

Memory Footprint Optimization: Quantized weights are stored in flash memory (often more plentiful than RAM). In-place computations for activations and direct matrix operations on quantized values further reduce RAM usage by avoiding intermediate dequantized copies.
Computational Efficiency: Activation functions are implemented via lookup tables to bypass costly floating-point operations on MCUs lacking hardware FPUs. “Batch-free” inference (one sample at a time) minimizes peak RAM. Early stopping heuristics during inference can prune unnecessary computations if an action can be determined with partial evaluation.
Energy-Aware Execution: The inference frequency itself is adapted based on battery level and environmental stability. Inference is coordinated with data transmission to utilize active processor states. Between decisions, the device enters optimized low-power modes, selected based on the predicted time to the next decision.

These strategies ensure that the GM-DDQN agent not only learns energy-efficient policies but also executes efficiently.

Algorithm 1 GM-DDQN Training Procedure

1:: Input: Learning rate $α$ , discount factor $γ$ , target update rate $β$ , gradient memory size m
2:: Output: Trained and quantized Q-network parameters $θ$
3:: Initialize main Q-network $θ$ and target Q-network $θ^{-} = θ$ randomly.
4:: Initialize gradient memory $G = 0^{m \times | θ |}$ .
5:: Initialize time step $t = 0$ .
6:: while not converged do
7:: Observe current state $s_{t}$ .
8:: Select action $a_{t}$ : with probability $ϵ$ choose random action, else $a_{t} = arg {max}_{a} Q (s_{t}, a; θ)$ .
9:: Execute $a_{t}$ : observe reward $r_{t}$ and next state $s_{t + 1}$ .
10:: Compute target Q-value $y_{t} = r_{t} + γ {max}_{a^{'}} Q (s_{t + 1}, a^{'}; θ^{-})$ .
11:: Compute loss $L_{t} = {(y_{t} - Q (s_{t}, a_{t}; θ))}^{2}$ .
12:: Compute gradient $g_{t} = \nabla_{θ} L_{t}$ .
13:: Store gradient: $G [t mod m] \leftarrow g_{t}$ .
14:: Compute weighted average gradient ${\bar{g}}_{t} = \sum_{j = 0}^{m - 1} w_{j} \cdot G [j]$ using Equation (18).
15:: Update main network: $θ \leftarrow θ - α \cdot {\bar{g}}_{t}$ .
16:: Update target network: $θ^{-} \leftarrow β \cdot θ + (1 - β) \cdot θ^{-}$ .
17:: $t \leftarrow t + 1$ .
18:: Anneal $ϵ$ .
19:: end while
20:: Quantize final parameters $θ$ for deployment.
21:: Return $θ$ .

4.7. Theoretical Analysis

Despite its modifications, GM-DDQN retains desirable theoretical properties while achieving its lightweight goals.

Convergence and Stability through Smoothed Gradient Estimation: The gradient memory mechanism, which uses a weighted sum of recent gradients for updates, can be viewed as a form of smoothed gradient estimation. By averaging over the last m gradients, it reduces the variance of the update signal, which can lead to more stable convergence, analogous to the role of momentum in optimization algorithms. This smoothing effect is particularly beneficial in noisy and non-stationary environments as it prevents the learning process from overreacting to transient fluctuations.

Approximating Recent Experience Sampling: Furthermore, the gradient memory serves as a computationally efficient proxy for experience replay. Instead of storing memory-intensive state-transition tuples

(s, a, r, s^{'})

, it stores a compact history of their resulting learning signals (gradients). The update rule in Equation (17) effectively approximates the expected gradient over a mini-batch of experiences drawn from an implicit non-uniform distribution that heavily favors recent policy-relevant interactions. This allows the agent to learn from a diverse yet temporally focused set of experiences, retaining the stabilization benefits of experience replay while being orders of magnitude more memory-efficient.

Computational Complexity: Inference complexity is dominated by matrix multiplications in the small network, roughly

O (n_{s} n_{h} + n_{h}^{2} + n_{h} | A |)

, significantly lower than larger networks. Training update complexity per step is primarily driven by gradient computation and the weighted sum over the gradient memory,

O (m | θ |)

, which is efficient due to small m and

| θ |

.

Approximation Error: Simplification and quantization introduce approximation errors. If

f^{*}

is the true optimal Q-function,

f_{DDQN}

the approximation by a standard DDQN, and

f_{GM - DDQN}

by our method, then

∥ f^{*} - f_{GM - DDQN} ∥_{\infty} \leq ∥ f^{*} - f_{DDQN} ∥_{\infty} + {∥ f_{DDQN} - f_{GM - DDQN} ∥}_{\infty}

. The second term, due to architectural simplification and quantization, can be bounded, with the architectural part related to

O (1 / \sqrt{n_{h}})

. Careful design aims to keep this additional error small.

In essence, GM-DDQN is tailored for resource-constrained settings by strategically replacing the large memory-intensive experience replay buffer with the compact computationally efficient gradient memory. This novel mechanism aims to preserve the core learning capabilities and stability of the DDQN framework through smoothed history-aware gradient updates.

5. Experiments

This section describes the experimental framework developed to evaluate the proposed GM-DDQN method, including the simulation environment, test scenarios, implementation specifics, comparative baselines, evaluation metrics, and a detailed analysis of the results. We conduct comprehensive experiments to assess the energy efficiency, communication reliability, computational resource requirements, and overall performance of our approach in various simulated thermal power plant scenarios.

5.1. Experimental Setup

5.1.1. Simulation Environment

To evaluate the performance of our proposed method in environments that closely mimic real thermal power plants, we developed a comprehensive simulation platform. A critical aspect of our experimental design is to distinguish between the learning agent and the overall network environment. Our simulation focuses on the decision-making process of a single IIoT node, which acts as the learning agent. This methodological choice is deliberate, allowing us to isolate and rigorously evaluate the learning capability of the GM-DDQN algorithm itself before addressing the more complex multi-agent learning problem.

To ensure the simulation remains representative of a dense network deployment, the presence of other concurrently operating IIoT devices is modeled as a component of the ambient interference. Specifically, the interference term

I_{i} (t)

in our model (Section 3.2.1) accounts not only for EMI from industrial machinery (

I_{v a r} (t)

and

I_{b u r s t} (t)

) but also for the aggregate stochastic interference generated by a population of peer devices. This approach allows us to test our agent’s robustness in a realistic interference-rich setting that reflects a multi-device reality without introducing the non-stationarity of a full multi-agent learning environment.

To assess the policy’s performance under varying channel conditions, we evaluated the agent at three representative distances from the gateway: a short distance (e.g., 20 m), a medium distance (e.g., 50 m), and a long distance (e.g., 80 m). The final results presented in this paper are averaged across these distance scenarios to ensure statistical robustness.

This simulation environment includes the following:

Thermal Power Plant Environment Model: Incorporating electromagnetic interference patterns (baseline, variable, and burst), equipment vibration dynamics (normal, fluctuating, and anomalous), temperature variations, and their impact on path loss characteristics.
Wireless Network Simulator: Modeling packet transmission success/failure based on SINR, channel dynamics (path loss influenced by temperature, as per Section 3.2.3), and interference effects.
IIoT Device Energy Consumption Model: Accounting for energy consumed during transmission, reception, data processing, and different sleep modes, as detailed in Section 3.3.
Hardware Resource Simulator: Tracking the computational (e.g., CPU cycles or equivalent time) and memory (ROM/RAM) footprint during algorithm execution on a target IIoT node.

The simulation parameters are listed in Table 3. These values are derived from a combination of actual measurements in thermal power plants, specifications of commercial IIoT devices (e.g., for battery capacity, processor frequency, and radio characteristics), and established models from wireless communication literature.

5.1.2. Test Scenarios

We evaluate the algorithm performance across three distinct scenarios, representative of typical operational challenges in thermal power plants:

Scenario 1: High-Interference Environment. This scenario simulates operations near high-power electrical equipment, characterized by strong electromagnetic interference. The SINR fluctuates significantly, typically between 5 and 15 dB, due to frequent and intense interference bursts, demanding robust adaptive transmission strategies from the algorithm.
Scenario 2: Variable Vibration Monitoring. This scenario focuses on applications like turbine health monitoring, where equipment vibration patterns frequently switch between normal operational levels and abnormal states (indicative of potential faults). This requires the algorithm to dynamically adjust its data sampling (implicitly linked to data generation for transmission) and transmission policies to capture critical events while conserving energy during quiescent periods.
Scenario 3: Temperature-Constrained Deployment. This scenario models the dual impact of high ambient temperatures on IIoT nodes: (a) temperature-dependent variations in wireless channel path loss (as described in Section 3.2.3), and (b) accelerated battery discharge rates at elevated temperatures.

For each scenario, simulations are run for an equivalent of 30 days of device operation, with key metrics recorded at 1 min intervals. This duration and granularity allow for the assessment of both short-term adaptive responses and long-term energy efficiency trends.

5.1.3. Implementation Details

The proposed GM-DDQN algorithm and baseline RL methods were implemented for simulation. For evaluating on-device performance metrics (memory and computation time), the GM-DDQN was profiled as if deployed on a target hardware platform characterized by

Processor: ARM Cortex-M4F operating at 48 MHz.
Memory: 32 KB RAM and 256 KB flash.
Radio: IEEE 802.15.4-compliant [59].

Deployment optimizations for GM-DDQN included 8-bit quantization for all network parameters and the use of fixed-point arithmetic during inference. The memory optimization techniques (as discussed in Section 4.6) and computational acceleration strategies (also from Section 4.6) were factored into these profiles.

5.2. Baseline Methods for Comparison

We compare GM-DDQN against a diverse set of baseline methods:

Fixed Policy (FP): A non-adaptive strategy using a fixed transmission power (10 dBm) and a predetermined periodic sleep schedule. This represents a basic IIoT deployment.
Threshold-Based Adaptive (TBA): Adjusts transmission power based on observed PSR and sleep intervals based on vibration thresholds. This reflects common adaptive practices in current industrial IoT.
Standard DDQN: A full-scale DDQN with 64 neurons per hidden layer and an experience replay buffer of 10,000 transitions, serving as a performance benchmark without strict resource constraints.
Deep Deterministic Policy Gradient (DDPG): An advanced actor–critic RL method for continuous control problems.
Proximal Policy Optimization (PPO): A stable policy gradient RL method widely used in various tasks.
Q-Learning with Function Approximation (Q-FA): A lightweight RL method using linear function approximation instead of deep networks, offering low computational cost but limited representational power.

To ensure a fair comparison, all reinforcement learning methods (standard DDQN, DDPG, PPO, Q-FA, and GM-DDQN) utilize the same state space, action space, and reward function definitions, as detailed in Section 3.4. We note that a comparison with an exhaustive search algorithm, which would yield the theoretical optimal policy, was not included. An exhaustive search would require evaluating every possible sequence of actions over the simulation horizon. Given an action space of size

| A | = 12

and a long time horizon T, the computational complexity is on the order of

O (| A |^{T})

, rendering it computationally intractable for any practical scenario. This intractability is precisely the motivation for employing intelligent approximation methods like DRL.

5.3. Evaluation Metrics

We employ a comprehensive suite of metrics, categorized as follows, to evaluate the performance of each method:

Energy Efficiency:
- Energy consumption per successfully transmitted packet (mJ/packet).
- Average power consumption (mW).
- Estimated battery lifetime (months).
- Distribution of energy consumption across components (e.g., transmission, processing, and sleep), which will be detailed with the results where applicable.
Communication Performance:
- PSR (%).
- End-to-end latency (ms).
- Throughput (kbps).
- Adaptability to interference bursts (e.g., time for PSR to stabilize after a burst).
Resource Utilization (for on-device deployment profile):
- Memory footprint (ROM and RAM in KB).
- Computation time per decision-making step (ms).
- Energy overhead of algorithm execution per decision (mJ/decision).
- Initialization time (s).
Learning Performance (for RL methods):
- Convergence speed (number of training episodes required).
- Cumulative reward achieved during training and evaluation.
- Policy stability (e.g., consistency of actions in similar states, assessed qualitatively and by observing reward variance).
- Adaptation time to significant environmental changes (recovery duration).

This multi-dimensional evaluation aims to provide a holistic view, capturing not only direct energy savings but also potential trade-offs in communication, resource usage, and learning dynamics.

5.4. Experimental Results and Analysis

The comprehensive experimental results are summarized in Figure 3, which evaluates the proposed GM-DDQN against baseline methods in terms of energy efficiency, communication performance, and learning capability.

5.4.1. Energy Efficiency Performance

The primary goal of GM-DDQN is to enhance energy efficiency. Figure 3a visually presents the estimated battery lifetime for all the evaluated methods across the three test scenarios, highlighting GM-DDQN’s effectiveness. The precise values are also listed in Table 4. GM-DDQN extends battery lifetime by up to 73% compared to the fixed policy (FP) and by up to 38% compared to Threshold-Based Adaptation (TBA). For an IIoT deployment with hundreds of sensors, this translates to extending the battery replacement cycle by several months, significantly lowering operational and maintenance costs.

In terms of energy consumption per successfully transmitted packet, GM-DDQN achieved an average reduction of 42% compared to FP and 27% compared to TBA across all the scenarios. When compared with other RL algorithms, GM-DDQN’s energy consumption per packet was only marginally (5–8%) higher than that of standard DDQN, DDPG, and PPO. This is a significant finding as it indicates that GM-DDQN’s lightweight design incurs minimal loss in optimization capability while drastically reducing resource requirements.

5.4.2. Communication Performance

Figure 3b illustrates the communication performance. The left panel shows the PSR for each method across the different scenarios, while the right panel compares the average PSR and average end-to-end latency. GM-DDQN consistently maintains a high PSR of 96–97% across all the scenarios. This performance is comparable to standard DDQN, DDPG, and PPO (97–98%), significantly higher than the fixed policy (88–92%), and slightly better than Threshold-Based Adaptation (94–95%). Q-FA achieves an intermediate PSR of 95–96% but exhibits greater variability.

Regarding end-to-end latency (Figure 3b, right panel), GM-DDQN achieves an average of 320 ms. This is well within the acceptable limits for most industrial monitoring applications, considerably better than the fixed policy (average 480 ms), and comparable to the other advanced RL methods, which range between 290 and 310 ms. The Q-FA method resulted in an average latency of 350 ms.

5.4.3. Resource Utilization

Table 5 compares the on-device resource utilization metrics. GM-DDQN demonstrates a significant reduction in memory footprint (ROM and RAM) by over 99% compared to standard DDQN, DDPG, and PPO, making it viable for typical IIoT devices (e.g., 32 KB RAM). The computation time per decision is reduced from 85–120 ms for the larger models to just 12 ms for GM-DDQN (an 85–90% reduction). Consequently, the energy overhead per decision for GM-DDQN (0.6 mJ) is also substantially lower (85–90% reduction) than that of the standard deep RL methods (4.2–6.0 mJ), rendering its own operational energy cost negligible relative to the savings it provides.

5.4.4. Learning Performance

The learning performance of the RL-based methods is evaluated from two complementary perspectives, as shown in Figure 3c,d.

First, Figure 3c illustrates the raw dynamics and overall convergence behavior of the training process. The smoothed trend line shows that GM-DDQN converges in approximately 3500 episodes. While this is slower than standard DDQN (around 2200 episodes), it is still well within the practical limits for pre-deployment training. Importantly, the final cumulative reward achieved by GM-DDQN is approximately 94% of that achieved by standard DDQN, indicating that its lightweight design preserves the vast majority of the learning capability.

Second, to specifically address the stability of the learning process, a dedicated statistical analysis is presented in Figure 3d. This plot provides a clearer noise-free comparison of the algorithms’ robustness. The shaded area around each curve, representing the standard deviation across 10 independent runs, serves as a direct visual indicator of performance consistency. The narrow shaded region associated with GM-DDQN’s learning curve signifies low variance, indicating that the algorithm’s performance is highly stable and reliably repeatable across different training initializations. This degree of stability is comparable to that of the more complex standard DDQN and PPO methods, and notably superior to the wider variance observed for the simpler Q-FA method. This result validates the robustness of our proposed lightweight approach.

In tests involving sudden changes to interference patterns or vibration levels, GM-DDQN demonstrated a recovery time (adaptation to the new conditions) of 15–20 min. This compares to 10–15 min for standard DDQN and 5–10 min for DDPG/PPO. While slightly slower in adaptation, GM-DDQN’s response time is still adequate for industrial scenarios, where significant environmental shifts typically occur over longer timescales.

5.4.5. Performance Analysis in Specific Scenarios

GM-DDQN exhibited robust and adaptive performance across the diverse test scenarios:

High-Interference Environment (Scenario 1): GM-DDQN effectively modulated its transmission power in response to dynamic interference levels. It increased power during interference peaks to maintain high PSR and reduced power during lulls to conserve energy. This outperformed the fixed policy, which wasted energy, and the Threshold-Based Adaptive strategy, which responded more slowly to sudden interference bursts.
Variable Vibration Monitoring (Scenario 2): GM-DDQN adeptly managed sleep modes according to vibration intensity. It remained active during high vibration, transitioned to light sleep for moderate vibration, and entered deep sleep during normal (low) vibration periods. This strategy ensured the capture of critical data while maximizing energy savings, proving superior to the sub-optimal switching of the TBA strategy and the non-adaptive fixed policy.
Temperature-Constrained Deployment (Scenario 3): GM-DDQN demonstrated the best resilience to temperature variations. Its energy consumption increased by only 32% as temperatures rose from 40 °C to 120 °C. This minimal increase, compared to the other algorithms, highlights its superior ability to learn and adapt to the complex interplay between environmental factors (like temperature-induced path loss changes) and optimal operational strategies.

5.4.6. Ablation Study and Component Analysis

To assess the individual contributions of GM-DDQN’s key components, an ablation study was performed. Table 6 summarizes the average performance across scenarios when specific components are removed or altered.

The ablation results, presented in Table 6, provide critical insights into the specific contribution of each component within the GM-DDQN framework. The following analysis delves into the underlying reasons for the observed performance changes.

Without Gradient Memory: Removing the gradient memory and reverting to a simple single-sample update leads to a noticeable increase in energy consumption. This result validates the efficacy of our core innovation. As theorized in Section 4.7, the gradient memory provides a smoothed history-aware gradient estimate. This approach mitigates the high variance inherent in updates based on single, potentially noisy, state transitions, leading to a more stable learning process and a more consistently energy-efficient policy.
Without Quantization: This variant reveals a crucial performance–resource trade-off. While forgoing quantization yields a marginal improvement in energy efficiency (a 3.7% reduction in mJ/packet), it causes a prohibitive 217% increase in computation time (from 12 ms to 38 ms). The performance impact of quantization is minimal because the learned Q-function is robust; the optimal actions in this problem space are determined by significant differences in Q-values, not by fine-grained distinctions that would be lost to 8-bit precision. This finding is consistent with the established literature, demonstrating that robust neural networks are highly amenable to quantization. Therefore, the immense computational gain, which is essential for on-device deployment, far outweighs the negligible loss in policy optimality.
Single Hidden Layer: Using only a single hidden layer results in a significant 14% degradation in energy efficiency. This is attributed to insufficient representational capacity. A single-layer network struggles to model the complex non-linear relationships between the diverse state variables (e.g., SINR, vibration, and temperature) and the optimal action. The second hidden layer is crucial for creating hierarchical feature representations, allowing the model to capture the intricate trade-offs demanded by our multi-objective reward function, a task for which a flatter architecture is ill-equipped.
Without Target Network: The removal of the target network causes the most substantial performance degradation, with a 25% decrease in energy efficiency. This instability arises because, without a stable separate target network, the learning target, $y_{t}$ , becomes highly non-stationary. The same parameters ( $θ_{t}$ ) are used to both estimate the current Q-value and calculate the target value, creating a destructive feedback loop that leads to severe learning instability and oscillations. The use of a slowly updated target network is therefore indispensable for decoupling the updates and ensuring stable convergence to a high-quality policy, a finding consistent with the foundational principles of deep Q-learning.

In summary, this in-depth analysis confirms that each component of GM-DDQN is a deliberate design choice. The framework’s success stems from the synergistic interplay between the core innovation of gradient memory and carefully selected lightweighting techniques, which collectively achieve a robust balance between high performance and resource efficiency.

6. Discussion

6.1. Summary of Experimental Results and Their Significance

The experimental findings validate the exceptional performance of the GM-DDQN framework in optimizing energy efficiency for industrial IoT devices within thermal power plants, underscoring its substantial potential for real-world deployment. Across three distinct testing scenarios—high-interference environments, vibration-monitoring fluctuations, and high-temperature constrained deployments—GM-DDQN demonstrates notable energy efficiency gains. Specifically, it reduces per-packet transmission energy consumption by 42% compared to the fixed policy and 27% compared to the threshold-based adaptive strategy. When benchmarked against advanced reinforcement learning methods like standard DDQN, DDPG, and PPO, GM-DDQN achieves comparable energy efficiency (differing by only 5–8%) while drastically lowering computational resource demands. This reflects the effectiveness of its lightweight design in maintaining optimization capabilities with minimal compromise.

Battery life evaluations further reveal that GM-DDQN extends device lifespan by up to 73% over the fixed policy and 38% over the threshold-based adaptive approach. For power plants with hundreds of sensors, this translates to significant cost savings by extending maintenance cycles by months. Moreover, GM-DDQN sustains a high PSR of 96–97%, closely aligning with the 97–98% achieved by other reinforcement learning methods, and far surpassing the fixed policy’s 88–92% and the threshold-based adaptive’s 94–95%. This reliability is achieved while maintaining an average end-to-end latency of 320 milliseconds, which meets typical industrial monitoring requirements.

In terms of learning performance, GM-DDQN converges after approximately 3500 episodes. While this is slower than standard DDQN’s 2200 episodes, its final cumulative reward reaches 94% of that achieved by standard DDQN. This affirms that the lightweight design has a negligible impact on learning capability, positioning GM-DDQN as a practical solution for resource-constrained environments.

6.2. Research Contributions

This study’s contributions, validated through rigorous experimentation, offer significant theoretical and practical advancements.

First, the core innovation of the gradient memory mechanism was empirically proven to be a viable memory-efficient alternative to experience replay for on-device learning. As quantified in Table 5, this mechanism, combined with a compact network architecture and quantization, reduces the total on-device memory footprint by approximately 99% compared to standard DDQN, making deployment on resource-constrained IIoT devices (32 KB RAM) feasible. The ablation study (Table 6) further confirmed its importance, showing that its removal led to noticeable degradation in energy efficiency, thus validating its role in stabilizing the learning process.

Second, the holistic lightweight design of the GM-DDQN framework demonstrated that substantial resource reduction does not necessitate a major compromise in performance. Figure 3d shows that GM-DDQN achieves a final cumulative reward that is 94% that of the full standard DDQN, with comparable learning stability. This addresses a critical trade-off in edge intelligence: achieving high performance with minimal overhead.

Third, the practical value of GM-DDQN was substantiated through scenario-based evaluations, which showed significant improvements in operational metrics. The joint optimization of the power and sleep modes extended battery lifetime by up to 73% compared to fixed policies (Figure 3a and Table 4), translating to a tangible reduction in maintenance costs from an 8–9 month cycle to a 14–15 month cycle. This was achieved while maintaining a high PSR of 96–97% (Figure 3b), demonstrating the framework’s ability to balance energy savings with communication reliability, a key requirement for industrial applications. The adaptive strategies learned by the agent proved effective across diverse challenges, from high-interference environments (Scenario 1) to variable-load conditions (Scenario 2), as detailed in Section 5.4.5.

Collectively, these validated contributions provide a robust and practical solution for deploying edge intelligence in demanding industrial environments, offering a clear pathway to more efficient, reliable, and intelligent monitoring systems.

6.3. Limitations and Future Work

Despite the promising outcomes, this study has several limitations that, when properly contextualized, define a clear and logical roadmap for future research.

First, the validation of our framework relies on simulation. A potential simulation-to-reality gap exists, a crucial point concerning real-world applicability. While our simulation platform was designed to be highly realistic (as detailed in Section 5.1.1), real-world deployments may present unmodeled challenges, such as complex multipath fading or specific RF interference signatures. To bridge this gap, a crucial next step involves validating the simulation model against data collected from a real-world testbed. Our ongoing efforts are focused on deploying a small number of sensor nodes within an operational power plant to gather baseline data, which will be used to calibrate and refine our simulation parameters, thereby enhancing its fidelity.

Second, and intrinsically linked to our research methodology, is the use of a single-agent learning framework. This was a deliberate methodological choice designed to isolate and rigorously evaluate the primary contribution of this paper: the novel GM-DDQN algorithm and its core gradient memory mechanism. By simplifying the environment to a single agent, we could analyze the algorithm’s performance without the confounding variables of multi-agent dynamics, such as non-stationarity and credit assignment. As justified in Section 5.1.1, the influence of other nodes was modeled as realistic environmental interference, ensuring the robustness of our evaluation. Having now established the efficacy of the core algorithm in this controlled setting, the clear and vital future direction is to extend this framework to a multi-agent reinforcement learning (MARL) setting to optimize the network-wide performance.

Third, following these steps, the subsequent phase would be empirical testing on physical IIoT hardware. Migrating the validated agent to an embedded system like an ARM Cortex-M4 introduces practical challenges, including RTOS task scheduling, precise energy accounting for memory access, and handling sensor noise. Overcoming these hardware-specific issues is the final step to fully substantiate the framework’s on-device viability.

Finally, other areas for future work include developing mechanisms for online or continual learning to adapt to entirely new scenarios, investigating the framework’s security vulnerabilities under adversarial conditions, and incorporating automated hyperparameter tuning methods to further enhance the practical deployment of GM-DDQN.

7. Conclusions

The proposed GM-DDQN framework effectively optimizes energy efficiency in industrial IoT devices, particularly for applications within thermal power plants. Experimental validation demonstrates its capability to achieve significant energy savings, leading to extended battery life, while maintaining reliable communication performance with minimal computational demands. These attributes make GM-DDQN a promising solution for resource-constrained IIoT deployments. Future work should focus on enhancing its convergence speed, further improving its adaptability to real-world dynamic conditions through mechanisms for online learning or transfer learning, and rigorously testing its performance and robustness in operational environments. Addressing these aspects will pave the way for broader practical application of lightweight deep reinforcement learning in creating sustainable and intelligent industrial systems.

Author Contributions

Conceptualization, S.G. and Y.Z.; methodology, S.G.; investigation, Y.Z. and S.G.; writing—original draft preparation, S.G.; writing—review and editing, S.G. and L.F.; supervision, L.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by the National Key Research and Development Program of China (2023YFB2703800), in part by the Science and Technology Development Fund, Macau SAR (0008/2025/RIB1 and 0010/2024/AGJ)), in part by the National Natural Science Foundation of China (61872452), in part by the Guangdong HUST Industrial Technology Research Institute, Guangdong Provincial Key Laboratory of Manufacturing Equipment Digitization (2023B1212060012).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article material. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sarjan, H.; Ameli, A.; Ghafouri, M. Cyber-security of industrial internet of things in electric power systems. IEEE Access 2022, 10, 92390–92409. [Google Scholar] [CrossRef]
Khan, W.Z.; Rehman, M.; Zangoti, H.M.; Afzal, M.K.; Armi, N.; Salah, K. Industrial internet of things: Recent advances, enabling technologies and open challenges. Comput. Electr. Eng. 2020, 81, 106522. [Google Scholar] [CrossRef]
Erhueh, O.V.; Elete, T.; Akano, O.A.; Nwakile, C.; Hanson, E. Application of Internet of Things (IoT) in energy infrastructure: Lessons for the future of operations and maintenance. Compr. Res. Rev. Sci. Technol. 2024, 2, 28–54. [Google Scholar] [CrossRef]
Qiu, F.; Kumar, A.; Hu, J.; Sharma, P.; Tang, Y.B.; Xiang, Y.X.; Hong, J. A Review on Integrating IoT, IIoT, and Industry 4.0: A Pathway to Smart Manufacturing and Digital Transformation. IET Inf. Secur. 2025, 2025, 9275962. [Google Scholar] [CrossRef]
Majhi, A.A.K.; Mohanty, S. A Comprehensive Review on Internet of Things Applications in Power Systems. IEEE Internet Things J. 2024, 11, 34896–34923. [Google Scholar] [CrossRef]
Hu, Y.; Jia, Q.; Yao, Y.; Lee, Y.; Lee, M.; Wang, C.; Zhou, X.; Xie, R.; Yu, F.R. Industrial internet of things intelligence empowering smart manufacturing: A literature review. IEEE Internet Things J. 2024, 11, 19143–19167. [Google Scholar] [CrossRef]
Tabaa, M.; Monteiro, F.; Bensag, H.; Dandache, A. Green Industrial Internet of Things from a smart industry perspectives. Energy Rep. 2020, 6, 430–446. [Google Scholar] [CrossRef]
Abdullahi, I.; Longo, S.; Samie, M. Towards a distributed digital twin framework for predictive maintenance in industrial internet of things (IIoT). Sensors 2024, 24, 2663. [Google Scholar] [CrossRef]
Mao, W.; Zhao, Z.; Chang, Z.; Min, G.; Gao, W. Energy-efficient industrial internet of things: Overview and open issues. IEEE Trans. Ind. Inform. 2021, 17, 7225–7237. [Google Scholar] [CrossRef]
Zhang, C.; Li, W.; Zhang, H.; Zhan, T. Recent Advances in Intelligent Data Analysis and Its Applications. Electronics 2024, 13, 226. [Google Scholar] [CrossRef]
Liu, X.; Xu, F.; Ning, L.; Lv, Y.; Zhao, C. A Novel Sensor Deployment Strategy Based on Probabilistic Perception for Industrial Wireless Sensor Network. Electronics 2024, 13, 4952. [Google Scholar] [CrossRef]
D’Agostino, P.; Violante, M.; Macario, G. A Scalable Fog Computing Solution for Industrial Predictive Maintenance and Customization. Electronics 2024, 14, 24. [Google Scholar] [CrossRef]
Dong, Z.; Cao, Y.; Xiong, N.; Dong, P. EE-MPTCP: An Energy-Efficient Multipath TCP Scheduler for IoT-based power grid monitoring systems. Electronics 2022, 11, 3104. [Google Scholar] [CrossRef]
Majid, M.; Habib, S.; Javed, A.R.; Rizwan, M.; Srivastava, G.; Gadekallu, T.R.; Lin, J.C.W. Applications of wireless sensor networks and internet of things frameworks in the industry revolution 4.0: A systematic literature review. Sensors 2022, 22, 2087. [Google Scholar] [CrossRef]
Foukalas, F.; Pop, P.; Theoleyre, F.; Boano, C.A.; Buratti, C. Dependable wireless industrial IoT networks: Recent advances and open challenges. In Proceedings of the 2019 IEEE European Test Symposium (ETS), Baden-Baden, Germany, 27–31 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–10. [Google Scholar]
Hudda, S.; Haribabu, K. A review on WSN based resource constrained smart IoT systems. Discov. Internet Things 2025, 5, 56. [Google Scholar] [CrossRef]
Mohapatra, A.G.; Mohanty, A.; Pradhan, N.R.; Mohanty, S.N.; Gupta, D.; Alharbi, M.; Alkhayyat, A.; Khanna, A. An Industry 4.0 implementation of a condition monitoring system and IoT-enabled predictive maintenance scheme for diesel generators. Alex. Eng. J. 2023, 76, 525–541. [Google Scholar] [CrossRef]
Aragonés, R.; Oliver, J.; Malet, R.; Oliver-Parera, M.; Ferrer, C. Model and Implementation of a Novel Heat-Powered Battery-Less IIoT Architecture for Predictive Industrial Maintenance. Information 2024, 15, 330. [Google Scholar] [CrossRef]
Chaudhari, S.S.; Bhole, K.S.; Rane, S.B. Industrial Automation and Data Processing Techniques in IoT-Based Digital Twin Design for Thermal Equipment: A case study. J. Inst. Eng. (India) Ser. C 2025, 106, 553–569. [Google Scholar] [CrossRef]
Aragonés, R.; Oliver, J.; Ferrer, C. Enhanced Heat-Powered Batteryless IIoT Architecture with NB-IoT for Predictive Maintenance in the Oil and Gas Industry. Sensors 2025, 25, 2590. [Google Scholar] [CrossRef]
Zhang, J.; Wang, Y.; Yang, Y.; Ma, Y.; Dai, Z. Fault diagnosis and intelligent maintenance of industry 4.0 power system based on internet of things technology and thermal energy optimization. Therm. Sci. Eng. Prog. 2024, 55, 102902. [Google Scholar] [CrossRef]
Prosper, J. Development of Wireless Temperature Sensing Systems for Rotating Equipment in Harsh Environments. 2023. [Google Scholar]
Dagnino, A. Data Analytics in the Era of the Industrial Internet of Things; Springer: Berlin/Heidelberg, Germany, 2021. [Google Scholar]
Balali, F.; Nouri, J.; Nasiri, A.; Zhao, T. Data Intensive Industrial Asset Management; Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
Yılmaz, M.Y.; Üstüner, B.; Gül, Ö.M.; Çırpan, H.A. Sustainable Communication in 5G/6G Wireless Sensor Networks: A Survey on Energy-Efficient Collaborative Routing. ITU J. Wirel. Commun. Cybersecur. 2025, 2, 11–26. [Google Scholar]
Dvir, E.; Shifrin, M.; Gurewitz, O. Cooperative Multi-Agent Reinforcement Learning for Data Gathering in Energy-Harvesting Wireless Sensor Networks. Mathematics 2024, 12, 2102. [Google Scholar] [CrossRef]
O’Reilly, C.; Gluhak, A.; Imran, M.A.; Rajasegarar, S. Anomaly detection in wireless sensor networks in a non-stationary environment. IEEE Commun. Surv. Tutor. 2014, 16, 1413–1432. [Google Scholar] [CrossRef]
Hicheri, R.; Abdelgawwad, A.; Pätzold, M. A non-stationary relay-based 3D MIMO channel model with time-variant path gains for human activity recognition in indoor environments. Ann. Telecommun. 2021, 76, 827–837. [Google Scholar] [CrossRef]
Careem, M.A.A.; Dutta, A. Real-time prediction of non-stationary wireless channels. IEEE Trans. Wirel. Commun. 2020, 19, 7836–7850. [Google Scholar] [CrossRef]
Singh, S.P.; Kumar, N.; Kumar, G.; Balusamy, B.; Bashir, A.K.; Al-Otaibi, Y.D. A hybrid multi-objective optimisation for 6G-enabled Internet of Things (IoT). IEEE Trans. Consum. Electron. 2024, 71, 1307–1318. [Google Scholar] [CrossRef]
Vijayalakshmi, K.; Maheshwari, A.; Saravanan, K.; Vidyasagar, S.; Kalyanasundaram, V.; Sattianadan, D.; Bereznychenko, V.; Narayanamoorthi, R. A novel network lifetime maximization technique in WSN using energy efficient algorithms. Sci. Rep. 2025, 15, 10644. [Google Scholar] [CrossRef]
Hamzei, M.; Khandagh, S.; Jafari Navimipour, N. A quality-of-service-aware service composition method in the internet of things using a multi-objective fuzzy-based hybrid algorithm. Sensors 2023, 23, 7233. [Google Scholar] [CrossRef]
Singh, S.P.; Kumar, N.; Kumar, G.; Balusamy, B.; Bashir, A.K.; Al Dabel, M.M. Enhancing Quality of Service in IoT-WSN through Edge-Enabled Multi-Objective Optimization. IEEE Trans. Consum. Electron. 2025. [Google Scholar] [CrossRef]
Hazra, A.; Tummala, V.M.R.; Mazumdar, N.; Sah, D.K.; Adhikari, M. Deep reinforcement learning in edge networks: Challenges and future directions. Phys. Commun. 2024, 66, 102460. [Google Scholar] [CrossRef]
Kornaros, G. Hardware-assisted machine learning in resource-constrained IoT environments for security: Review and future prospective. IEEE Access 2022, 10, 58603–58622. [Google Scholar] [CrossRef]
Chen, W.; Qiu, X.; Cai, T.; Dai, H.N.; Zheng, Z.; Zhang, Y. Deep reinforcement learning for Internet of Things: A comprehensive survey. IEEE Commun. Surv. Tutor. 2021, 23, 1659–1692. [Google Scholar] [CrossRef]
Sagar, A.S.; Islam, M.Z.; Haider, A.; Kim, H.S. Uncertainty-aware federated reinforcement learning for optimizing accuracy and energy in heterogeneous industrial IoT. Appl. Sci. 2024, 14, 8299. [Google Scholar] [CrossRef]
Yadav, R.K.; Malavika, V.; Rajendran, P.S. A Novel Approach to Optimize Energy Consumption in Industries Using IIoT and Machine Learning. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar]
Hou, L.; Tan, S.; Zhang, Z.; Bergmann, N.W. Thermal energy harvesting WSNs node for temperature monitoring in IIoT. IEEE Access 2018, 6, 35243–35249. [Google Scholar] [CrossRef]
Farné, S.; Bassi, E.; Benzi, F.; Compagnoni, F. IIoT based efficiency monitoring of a Gantry robot. In Proceedings of the 2016 IEEE 14th International Conference on Industrial Informatics (INDIN), Poitiers, France, 19–21 July 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 714–719. [Google Scholar]
Pradhan, A.; Das, S.; Piran, M.J. Blocklength optimization and power allocation for energy-efficient and secure URLLC in industrial IoT. IEEE Internet Things J. 2023, 11, 9420–9431. [Google Scholar] [CrossRef]
Solati, A.; Moghaddam, J.Z.; Ardebilipour, M. An Energy Efficiency Method in UAV-Assisted IIoT Network. In Proceedings of the 2023 7th International Conference on Internet of Things and Applications (IoT), Isfahan, Iran, 25–26 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–6. [Google Scholar]
Zhou, Z.; Zhang, C.; Xu, C.; Xiong, F.; Zhang, Y.; Umer, T. Energy-efficient industrial internet of UAVs for power line inspection in smart grid. IEEE Trans. Ind. Inform. 2018, 14, 2705–2714. [Google Scholar] [CrossRef]
Jiang, D.; Wang, Y.; Lv, Z.; Wang, W.; Wang, H. An energy-efficient networking approach in cloud services for IIoT networks. IEEE J. Sel. Areas Commun. 2020, 38, 928–941. [Google Scholar] [CrossRef]
Mamaghani, A.H.; Najafi, B.; Casalegno, A.; Rinaldi, F. Optimization of an HT-PEM fuel cell based residential micro combined heat and power system: A multi-objective approach. J. Clean. Prod. 2018, 180, 126–138. [Google Scholar] [CrossRef]
Ma, Y.; Liu, J.; Zhu, L.; Li, Q.; Guo, Y.; Liu, H.; Yu, D. Multi-objective performance optimization and control for gas turbine Part-load operation Energy-saving and NOx emission reduction. Appl. Energy 2022, 320, 119296. [Google Scholar] [CrossRef]
Kumar, R. A critical review on energy, exergy, exergoeconomic and economic (4-E) analysis of thermal power plants. Eng. Sci. Technol. Int. J. 2017, 20, 283–292. [Google Scholar] [CrossRef]
Qu, M.; Pan, L.; Lu, L.; Wang, J.; Tang, Y.; Chen, X. Study on thermal cycle efficiency improvement of secondary-loop in nuclear power plants based on dual-region topology optimization. Int. Commun. Heat Mass Transf. 2024, 159, 108183. [Google Scholar] [CrossRef]
Liu, Z.; Zhang, H.; Jin, X.; Zheng, S.; Li, R.; Guan, H.; Shao, J. Thermal economy analysis and multi-objective optimization of a small CO₂ transcritical pumped thermal electricity storage system. Energy Convers. Manag. 2023, 293, 117451. [Google Scholar] [CrossRef]
Cacciali, L.; Battisti, L.; Benini, E. Maximizing Efficiency in Compressed Air Energy Storage: Insights from Thermal Energy Integration and Optimization. Energies 2024, 17, 1552. [Google Scholar] [CrossRef]
Zhang, W.; He, Y.; Zhang, T.; Ying, C.; Kang, J. Intelligent resource adaptation for diversified service requirements in industrial IoT. IEEE Trans. Cogn. Commun. Netw. 2024. [Google Scholar] [CrossRef]
Dridi, A.; Afifi, H.; Moungla, H.; Badosa, J. A novel deep reinforcement approach for IIoT microgrid energy management systems. IEEE Trans. Green Commun. Netw. 2021, 6, 148–159. [Google Scholar] [CrossRef]
Dolatabadi, A.; Abdeltawab, H.; Mohamed, Y.A.R.I. A novel model-free deep reinforcement learning framework for energy management of a PV integrated energy hub. IEEE Trans. Power Syst. 2022, 38, 4840–4852. [Google Scholar] [CrossRef]
Chen, J.; Mi, J.; Guo, C.; Fu, Q.; Tang, W.; Luo, W.; Zhu, Q. Research on Offloading and Resource Allocation for MEC with Energy Harvesting Based on Deep Reinforcement Learning. Electronics 2025, 14, 1911. [Google Scholar] [CrossRef]
Yi, M.; Lin, M.; Chen, W. Network Function Placement in Virtualized Radio Access Network with Reinforcement Learning Based on Graph Neural Network. Electronics 2025, 14, 1686. [Google Scholar] [CrossRef]
Cicek, D.; Simsek, M.; Kantarci, B. Machine Learning-Driven Truck–Drone Collaborative Delivery for Time-and Energy-Efficient Last-Mile Deliveries. Electronics 2025, 14, 2026. [Google Scholar] [CrossRef]
Rahman, S.; Akter, S.; Yoon, S. A Deep Q-Learning Based UAV Detouring Algorithm in a Constrained Wireless Sensor Network Environment. Electronics 2024, 14, 1. [Google Scholar] [CrossRef]
Singh, S.; Ganorkar, A.M.; Anujhna, B. Enhancing Remote Oversight of Plants with a Compact IIoT System. In Proceedings of the 2023 IEEE International Conference on ICT in Business Industry & Government (ICTBIG), Indore, India, 8–9 December 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–7. [Google Scholar]
IEEE Std 802.15.4-2020; IEEE Standard for Low-Rate Wireless Networks. IEEE: New York, NY, USA, 2020.

Figure 1. System architecture for IIoT deployment in a thermal power plant. Sensor nodes (nodes 1, 2, and 3) gather data from the IIoT environment (e.g., power plant machinery). This data, forming the ‘state’

〈 γ_{i} (t), B_{i} (t), V_{i} (t), Q_{i} (t), T_{i} (t) 〉

, is transmitted (via the gateway antenna symbol) to the GM-DDQN agent for decision optimization. The agent determines an ‘action’

〈 p_{i} (t), s_{i} (t) 〉

and receives a ‘reward’ based on system performance.

Figure 1. System architecture for IIoT deployment in a thermal power plant. Sensor nodes (nodes 1, 2, and 3) gather data from the IIoT environment (e.g., power plant machinery). This data, forming the ‘state’

〈 γ_{i} (t), B_{i} (t), V_{i} (t), Q_{i} (t), T_{i} (t) 〉

, is transmitted (via the gateway antenna symbol) to the GM-DDQN agent for decision optimization. The agent determines an ‘action’

〈 p_{i} (t), s_{i} (t) 〉

and receives a ‘reward’ based on system performance.

Figure 2. Architecture of the proposed GM-DDQN. The state

s_{t}

is fed into the lightweight main Q-network to produce Q-values

Q_{i} (s_{t}, a)

. The target network, structurally similar but with delayed parameters, also estimates Q-values. The gradient memory stores recent gradients

G_{1}, \dots, G_{m}

, which are utilized in the update process of the main Q-network parameters, contributing to memory-efficient learning.

Figure 2. Architecture of the proposed GM-DDQN. The state

s_{t}

is fed into the lightweight main Q-network to produce Q-values

Q_{i} (s_{t}, a)

. The target network, structurally similar but with delayed parameters, also estimates Q-values. The gradient memory stores recent gradients

G_{1}, \dots, G_{m}

, which are utilized in the update process of the main Q-network parameters, contributing to memory-efficient learning.

Figure 3. Comprehensive performance evaluation of GM-DDQN and baseline methods. (a) Estimated battery lifetime across three scenarios. (b) Comparison of Packet Success Rate (PSR) and end-to-end latency. (c) Learning curves illustrating the raw reward fluctuations and smoothed convergence trends. (d) Stability analysis of the learning process, where solid lines represent the mean reward over 10 independent runs and shaded areas represent the standard deviation.

Table 1. Memory footprint: standard DDQN with experience replay vs. GM-DDQN.

Memory Component	Std. DDQN (10k Samples)	GM-DDQN ( $m = 8$ )
Experience/Gradient Storage	∼480 KB (for transitions)	∼4.5 KB (for gradients)
Network Parameters (Quantized)	∼21.3 KB (float32)	∼0.56 KB (int8)
Approx. Total On-Device	∼501.3 KB	∼5.06 KB

Table 2. Parameter count comparison: standard DDQN vs. GM-DDQN.

Network Layer	Std. DDQN ( $n_{h} = 64$ )	GM-DDQN ( $n_{h} = 16$ )	Reduction
Input to Hidden 1	$5 \times 64 + 64 = 384$	$5 \times 16 + 16 = 96$	75.0%
Hidden 1 to Hidden 2	$64 \times 64 + 64 =$ 4160	$16 \times 16 + 16 = 272$	93.5%
Hidden 2 to Output	$64 \times 12 + 12 = 780$	$16 \times 12 + 12 = 204$	73.8%
Total Parameters	5324	572	89.3%

Table 3. Simulation parameters used in the experimental evaluation.

Category	Parameter	Value
Environment	Baseline interference ( $I_{b a s e}$ )	−90 dBm
	Interference burst amplitude ( $I_{b u r s t}$ )	−80 to −60 dBm
	Normal vibration level ( $V_{n o r m a l}$ )	20 Hz
	Abnormal vibration level ( $V_{a n o m a l y}$ )	60–80 Hz
	Temperature range ( $T_{b a s e} \pm Δ T$ )	40 °C to 120 °C
Network	Path loss exponent ( $α$ )	3.5
	Path loss at reference distance ( $P L_{0}$ )	−40 dBm
	Temperature coefficient ( $β$ )	0.002
	Noise floor ( $N_{0}$ )	−100 dBm
	Data packet size (L)	128 bytes
IIoT Device	Battery capacity ( $B^{m a x}$ )	1000 mAh
	Tx power levels ( $P$ )	${0, 5, 10, 15}$ dBm
	Sleep mode power ( $e^{s}$ )	${50, 5, 0.1}$ mW
	Wake-up delay ( $d^{s}$ )	${0, 10, 100}$ ms
	Processor frequency	48 MHz
Algorithm (GM-DDQN)	Learning rate ( $α$ )	0.001
	Discount factor ( $γ$ )	0.95
	Target network update rate ( $β$ )	0.05
	Gradient memory size (m)	8

Table 4. Estimated battery lifetime (months) across different scenarios.

Method	Scenario 1	Scenario 2	Scenario 3
Fixed Policy (FP)	8.2	9.1	7.5
Threshold-Based Adaptation (TBA)	10.3	11.7	9.6
Q-FA	11.8	12.5	10.2
GM-DDQN (Proposed)	14.2	15.1	13.6
Standard DDQN	14.8	15.7	14.1
DDPG	15.0	15.9	14.3
PPO	14.7	15.6	14.0

Table 5. Resource utilization comparison for on-device deployment profile.

Method	ROM (KB)	RAM (KB)	Comp. Time (ms)	Energy Overhead (mJ/decision)
Fixed Policy (FP)	0.5	0.1	0.02	0.001
Threshold-Based Adaptation (TBA)	1.2	0.3	0.05	0.002
Q-FA	2.3	1.2	0.8	0.04
GM-DDQN (Proposed)	4.8	2.4	12	0.6
Standard DDQN	42	520	85	4.2
DDPG	78	620	120	6.0
PPO	65	580	95	4.7

Table 6. Ablation study results (averaged across scenarios).

Method Variant	Energy Cons. (mJ/packet)	PSR (%)	Comp. Time (ms)
Full GM-DDQN	0.28	96.5	12
Without Gradient Memory	0.30	95.8	15
Without Quantization	0.27	96.7	38
Single Hidden Layer	0.32	94.2	8
Without Target Network	0.35	92.3	10

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gao, S.; Zou, Y.; Feng, L. A Lightweight Double-Deep Q-Network for Energy Efficiency Optimization of Industrial IoT Devices in Thermal Power Plants. Electronics 2025, 14, 2569. https://doi.org/10.3390/electronics14132569

AMA Style

Gao S, Zou Y, Feng L. A Lightweight Double-Deep Q-Network for Energy Efficiency Optimization of Industrial IoT Devices in Thermal Power Plants. Electronics. 2025; 14(13):2569. https://doi.org/10.3390/electronics14132569

Chicago/Turabian Style

Gao, Shuang, Yuntao Zou, and Li Feng. 2025. "A Lightweight Double-Deep Q-Network for Energy Efficiency Optimization of Industrial IoT Devices in Thermal Power Plants" Electronics 14, no. 13: 2569. https://doi.org/10.3390/electronics14132569

APA Style

Gao, S., Zou, Y., & Feng, L. (2025). A Lightweight Double-Deep Q-Network for Energy Efficiency Optimization of Industrial IoT Devices in Thermal Power Plants. Electronics, 14(13), 2569. https://doi.org/10.3390/electronics14132569

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Lightweight Double-Deep Q-Network for Energy Efficiency Optimization of Industrial IoT Devices in Thermal Power Plants

Abstract

1. Introduction

1.1. Background

1.2. Key Challenges in Energy Efficiency Optimization

1.3. Our Contributions

2. Related Work

2.1. Energy Efficiency Optimization in IIoT

2.2. Reinforcement Learning-Based Energy and Resource Management

2.3. Lightweight DRL Methods and Edge Intelligence

2.4. Research Gaps and Contributions

3. Problem Modeling and Environment Definition

3.1. System Model

3.2. Thermal Power Plant Environment Characteristics

3.2.1. Electromagnetic Interference Patterns

3.2.2. Equipment Vibration and Monitoring Needs

3.2.3. Temperature and Path Loss Variations

3.3. Energy Consumption Model

3.4. MDP Formulation

3.4.1. State Space ( S )

3.4.2. Action Space ( A )

3.4.3. Transition Probability ( P )

3.4.4. Reward Function ( R )

3.4.5. Discount Factor ( γ )

3.5. Optimization Objective

3.6. Problem Complexity Analysis

4. Lightweight Double-Deep Q-Network (GM-DDQN) Method

4.1. GM-DDQN Architecture Overview

4.2. The Core Innovation: The Gradient Memory Mechanism

4.3. Supporting Lightweighting Techniques

4.3.1. Compact Neural Network Architecture and Quantization

4.3.2. Efficient Target Network Updates

4.4. State and Action Representation

4.4.1. State Normalization

4.4.2. Action Discretization

4.5. Training Procedure

4.6. Implementation Considerations for IIoT Devices

4.7. Theoretical Analysis

5. Experiments

5.1. Experimental Setup

5.1.1. Simulation Environment

5.1.2. Test Scenarios

5.1.3. Implementation Details

5.2. Baseline Methods for Comparison

5.3. Evaluation Metrics

5.4. Experimental Results and Analysis

5.4.1. Energy Efficiency Performance

5.4.2. Communication Performance

5.4.3. Resource Utilization

5.4.4. Learning Performance

5.4.5. Performance Analysis in Specific Scenarios

5.4.6. Ablation Study and Component Analysis

6. Discussion

6.1. Summary of Experimental Results and Their Significance

6.2. Research Contributions

6.3. Limitations and Future Work

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.4.1. State Space ( $S$ )

3.4.2. Action Space ( $A$ )

3.4.3. Transition Probability ( $P$ )

3.4.4. Reward Function ( $R$ )

3.4.5. Discount Factor ( $γ$ )