Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (337)

Search Parameters:
Keywords = markov decision process (MDP)

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
17 pages, 13011 KB  
Article
An Anti-Swept-Frequency-Jamming Communication Method Based on Proximal Policy Optimization for Nonlinear Scenarios
by Xinrui Xu, Ke Yin, Yingtao Niu and Huacheng Zhu
Electronics 2026, 15(12), 2737; https://doi.org/10.3390/electronics15122737 (registering DOI) - 22 Jun 2026
Abstract
With the advancement in electronic attack technologies, intelligent jamming poses a significant challenge to the reliable transmission of wireless communications. Traditional anti-jamming methods often fail to adapt to dynamic nonlinear jamming environments. This paper addresses nonlinear swept-frequency jamming by modeling anti-jamming communication as [...] Read more.
With the advancement in electronic attack technologies, intelligent jamming poses a significant challenge to the reliable transmission of wireless communications. Traditional anti-jamming methods often fail to adapt to dynamic nonlinear jamming environments. This paper addresses nonlinear swept-frequency jamming by modeling anti-jamming communication as a sequential decision-making problem and proposes an intelligent anti-jamming method based on proximal policy optimization (PPO) to optimize dynamic channel selection. Firstly, the channel selection problem is formalized as a Markov decision process (MDP), where a state space integrating jamming patterns and communication status is designed, the channel set is defined as the action space, and a multi-objective reward function trades off jamming avoidance against switching overhead. A dual-network architecture comprising a policy network and a value network is constructed, and the PPO algorithm is employed for policy updates, where a clipping mechanism is used to enhance training stability. The system optimizes the anti-jamming strategy online through a closed-loop process of “sensing–decision–learning–communication”. Simulation results demonstrate that compared to conventional methods, the proposed method significantly improves key performance indicators such as packet success rate and throughput. It can rapidly track changes in jamming, exhibiting excellent real-time performance and environmental robustness, and thus provides an effective solution for reliable communication in dynamic jamming environments. Full article
(This article belongs to the Section Microwave and Wireless Communications)
Show Figures

Figure 1

20 pages, 6237 KB  
Article
Belief-Guided Homeostatic Estimation for Regime Adaptation in Multi-Layer Industrial Network Scheduling
by Wei Xu, Yi Wan and T. Zuo
Algorithms 2026, 19(6), 487; https://doi.org/10.3390/a19060487 - 17 Jun 2026
Viewed by 169
Abstract
Scheduling in multi-layer industrial networks must remain stable even when the feedback mechanism of the environment changes inside a single production episode. The system can switch between a step-continuous regime with dense process feedback and a task-driven regime with sparse milestone feedback, so [...] Read more.
Scheduling in multi-layer industrial networks must remain stable even when the feedback mechanism of the environment changes inside a single production episode. The system can switch between a step-continuous regime with dense process feedback and a task-driven regime with sparse milestone feedback, so that the same state requires different behaviour before and after the switch. A regime-oblivious policy may therefore optimise the wrong action preference after a switch. We formulate this setting as a mode-switched multi-industrial-chain Markov decision process (MS-MIC-MDP) and prove that a single fixed action preference is necessarily suboptimal in at least one regime. We then propose BHERA, a belief-guided homeostatic estimation framework for regime adaptation. BHERA builds cross-layer representations, performs structured variational inference of slow and fast latent beliefs, estimates the posterior probability of the task-driven regime, and uses that posterior to regulate sample weights, entropy strength, return-prediction emphasis, and latent information capacity. A homeostatic feedback rule on the Kullback–Leibler (KL) divergence keeps the latent representation informative without allowing uncontrolled information growth, and we analyse it as a two-timescale stochastic approximation with an associated convergence argument and a per-iteration complexity bound. Experiments in a multi-layer industrial scheduling simulator show that BHERA achieves higher return, lower cost, and higher utility than CReSCENT, HiTAC-MuSE, Informed Switching, and WToE across all tested perturbations, with paired statistical tests confirming significance. Expanded ablations and parameter-sensitivity studies confirm the importance of regime belief, regime-balanced weighting, bootstrap prediction, homeostatic capacity control, and the dual-timescale latent split. Full article
Show Figures

Figure 1

20 pages, 2496 KB  
Article
A CNN-PPO Deep Reinforcement Learning Method for Maritime Target Motion Prediction
by Xiao Zheng, Xiaodong Peng, Runnan Qin, Wenming Xie and Lie Qiang
Appl. Sci. 2026, 16(11), 5509; https://doi.org/10.3390/app16115509 - 1 Jun 2026
Viewed by 328
Abstract
In recent years, with the continuous development of the global economy, maritime transportation has become an essential mode of transportation for both domestic and international trade, making it an urgent task for countries around the world to improve the intelligence level of maritime [...] Read more.
In recent years, with the continuous development of the global economy, maritime transportation has become an essential mode of transportation for both domestic and international trade, making it an urgent task for countries around the world to improve the intelligence level of maritime traffic management. As a key technology in real-time vessel monitoring, traffic hazard early warning, and traffic flow estimation, vessel trajectory prediction has been widely applied across both civil and commercial sectors. To address the problems with existing trajectory prediction methods, such as difficulty in establishing real-time vessel kinematic models, limited understanding of vessel navigation strategies, and poor adaptability for long-term prediction, this paper proposes a deep reinforcement learning-based vessel trajectory prediction method. Firstly, the trajectory prediction problem is formulated as a Markov Decision Process (MDP), thereby transforming the prediction problem into an optimal policy-solving problem of the MDP. Secondly, given the complexity of the trajectory prediction problem, a convolutional neural network (CNN) is adopted to parameterize the policy network, and a CNN-PPO deep reinforcement learning method is used to learn the target vessel’s navigation strategy from historical trajectories. Comparative experimental results show that the proposed algorithm is not only suitable for predicting the target’s position at a specific future moment but also offers clear advantages in trajectory prediction across multiple future moments. Full article
(This article belongs to the Special Issue AI Applications in the Maritime Sector)
Show Figures

Figure 1

26 pages, 1505 KB  
Article
TADS-DQN: A Trigger-Based Adaptive Deception Strategy Evolution Method Using Deep Q-Networks
by Zhihao Zhao, Xiran Wang, Leyi Shi and Juan Wang
Modelling 2026, 7(3), 110; https://doi.org/10.3390/modelling7030110 - 1 Jun 2026
Viewed by 256
Abstract
As an active defense paradigm, cyber deception technology effectively misleads attackers by constructing deceptive network environments, thereby increasing the cost of attack operations and introducing uncertainty into their decision-making, while providing defenders with critical response time. However, existing deception strategies are mostly based [...] Read more.
As an active defense paradigm, cyber deception technology effectively misleads attackers by constructing deceptive network environments, thereby increasing the cost of attack operations and introducing uncertainty into their decision-making, while providing defenders with critical response time. However, existing deception strategies are mostly based on predefined static rules derived from expert knowledge and lack the ability to adapt to dynamic attack scenarios autonomously and intelligently. This limitation results in poor adaptability and suboptimal performance of the strategy. To solve these issues, this paper proposes an Adaptive Cyber Deception Defense System (ACDDS). Different from off-the-shelf MDP/DQN frameworks in existing adaptive defense, the core innovation of ACDDS is a scenario-customized Trigger-based Adaptive Deception Strategy evolution method using Deep Q-Networks (TADS-DQN). We specifically formulate the dynamic deception strategy optimization as a cyber-deception-tailored Markov Decision Process (MDP). In this model, the state of the system is represented as a state matrix, and the attack behavior defines the environment for agent interaction. The TADS-DQN method employs a trigger-based mechanism: when a threat to real services is detected, a Deep Q-Network agent is activated. This agent takes the current system state as input and outputs the optimal reconfiguration action. The simulation results indicate that, compared to the baseline methods, TADS-DQN provides more stable defense performance, as evidenced by a smaller fluctuation range and a lower standard deviation of the attack success rate. At the same time, it achieves a reduction in the hit rate against real services that is competitive with the baseline methods. Full article
Show Figures

Figure 1

30 pages, 6485 KB  
Article
A Multi-Agent Emergency Material Allocation Approach Based on a Markov Decision Process Under Demand Uncertainty for Sustainable Disaster Response
by Lu Huang and Jundong Hou
Sustainability 2026, 18(11), 5539; https://doi.org/10.3390/su18115539 - 1 Jun 2026
Viewed by 171
Abstract
Effective emergency relief allocation in dynamic post-disaster environments depends critically on accurate and timely demand information. From a sustainability perspective, improving allocation accuracy is essential for using scarce rescue resources efficiently and supporting resilient disaster response. However, existing demand forecasting approaches frequently exhibit [...] Read more.
Effective emergency relief allocation in dynamic post-disaster environments depends critically on accurate and timely demand information. From a sustainability perspective, improving allocation accuracy is essential for using scarce rescue resources efficiently and supporting resilient disaster response. However, existing demand forecasting approaches frequently exhibit systematic bias, leading to resource misallocation and diminished rescue outcomes. Although deploying on-site assessment teams can partially mitigate this limitation, a unified framework that systematically embeds field assessment feedback into operational allocation processes remains lacking. To bridge this gap, this study proposes a multi-agent joint assessment-allocation model that facilitates coordinated operations between demand assessment and resource distribution activities. The sequential decision-making process is formulated as a Markov Decision Process (MDP), and deep reinforcement learning is employed to coordinate the actions of assessment and allocation teams, enabling allocation policies to be continuously refined through real-time field feedback. By improving the match between actual demand and material supply, the proposed model aims to support more resource-efficient disaster response under demand uncertainty. An empirical case study based on the 2025 Dingri County earthquake in Tibet is conducted to validate the proposed framework. Results demonstrate that integrating assessment feedback substantially improves resource allocation performance: in multi-site rescue scenarios, the framework increases the number of rescued individuals, reduces mission completion time, and enhances overall demand satisfaction. Further sensitivity analysis reveals that a moderate increase in team size strengthens cross-site coordination, whereas excessive team deployment yields diminishing returns and may generate operational redundancy. These findings suggest that sustainable emergency management depends not only on the availability of relief resources, but also on the efficient coordination of real-time information acquisition and material allocation. The proposed framework offers a generalizable approach for integrating real-time information acquisition with dynamic relief allocation. It improves the efficient utilization of scarce rescue resources, reduces avoidable operational redundancy, and strengthens the resilience of emergency response systems, thereby contributing to sustainable disaster risk reduction. Full article
Show Figures

Figure 1

43 pages, 1371 KB  
Article
Optimization of Control for a Hybrid Renewable Energy System with Energy Storage Using Deep Reinforcement Learning Methods
by Žydrūnas Kavaliauskas, Mindaugas Milieška, Giedrius Blažiūnas, Giedrius Gecevičius and Hassan Zhairabany
Sustainability 2026, 18(11), 5443; https://doi.org/10.3390/su18115443 - 28 May 2026
Viewed by 582
Abstract
This paper presents a forecasting and optimization framework for the control of a hybrid renewable energy system (HRES) integrating solar, wind, and biomass generation with lithium-ion batteries, electrolyzers, and fuel cells. A bidirectional long short-term memory (bi-LSTM) neural network model was applied for [...] Read more.
This paper presents a forecasting and optimization framework for the control of a hybrid renewable energy system (HRES) integrating solar, wind, and biomass generation with lithium-ion batteries, electrolyzers, and fuel cells. A bidirectional long short-term memory (bi-LSTM) neural network model was applied for renewable generation and load forecasting, while the deep Q-network (DQN) and soft actor–critic (SAC) algorithms were used for real-time supervisory control of energy storage and hydrogen-based components. The HRES was formulated as a Markov decision process (MDP), where the agents optimize battery charging/discharging, electrolyzer activation, and fuel cell operation under dynamically changing operating conditions. Experimental results demonstrated that the SAC agent achieved more stable learning dynamics and superior operational performance compared to the DQN agent, maintaining an HRES energy imbalance below 0.5 MWh while reducing unnecessary component switching and improving overall system stability. The obtained results confirm the potential of deep reinforcement learning for adaptive and low-emission supervisory control of complex hybrid renewable energy systems. Full article
Show Figures

Figure 1

22 pages, 1547 KB  
Article
Joint Beam Switching and Beam Design for RIS-Assisted Multi-Base Station IoV
by Jinxiang Lai, Deqing Wang and Yifeng Zhao
Appl. Sci. 2026, 16(11), 5399; https://doi.org/10.3390/app16115399 - 28 May 2026
Viewed by 150
Abstract
With the wide application of artificial intelligence (AI) in the Internet of Vehicles (IoV), IoV is under pressure for data transmission and real-time sensing. Integrated sensing and communication (ISAC) is one of the key technologies to alleviate that pressure. Obstacles can cause communication [...] Read more.
With the wide application of artificial intelligence (AI) in the Internet of Vehicles (IoV), IoV is under pressure for data transmission and real-time sensing. Integrated sensing and communication (ISAC) is one of the key technologies to alleviate that pressure. Obstacles can cause communication disruptions and increased delays, hindering autonomous driving information acquisition and causing traffic hazards. The application of Reconfigurable Intelligent Surfaces (RISs) aims to solve this problem. This study focuses on RIS-assisted multi-base station (MBS) scenarios in the presence of obstacles. This study aims to maximize the communication rate, minimize the sensing error, and reduce the switching frequency by optimizing the RIS phase shift and beamforming. The problem is modeled as mixed integer nonlinear programming (MINLP) and further described as a Markov Decision Process (MDP). We use Long Short-Term Memory (LSTM) to predict the environmental state and propose two optimization algorithms, Multi-Factor Decision Deep Deterministic Policy Gradient (MFD-DDPG) and Mixed Discrete and Continuous Action DDPG (MDCA-DDPG). In the first algorithm, we consider multiple factors to make a switching decision and use DDPG to yield the optimal action. The second algorithm improves DDPG by outputting a discrete switching decision and a continuous optimized action simultaneously. Simulations show that the proposed algorithms significantly improve the system performance, and the communication rate is increased by more than 40% in specific multi-vehicle scenarios compared to the benchmark. Full article
(This article belongs to the Section Electrical, Electronics and Communications Engineering)
Show Figures

Figure 1

21 pages, 1732 KB  
Article
Resource-Aware Deep Reinforcement Learning for Joint Caching and Service Placement in Multi-Access Edge Computing
by Elias Dritsas and Maria Trigka
Electronics 2026, 15(10), 2074; https://doi.org/10.3390/electronics15102074 - 13 May 2026
Viewed by 357
Abstract
Multi-access edge computing (MEC) enables low-latency service provisioning by placing computation closer to mobile users. However, efficient service placement remains challenging due to dynamic user mobility, limited edge resources, and the need to manage service migration as system conditions evolve. This study proposes [...] Read more.
Multi-access edge computing (MEC) enables low-latency service provisioning by placing computation closer to mobile users. However, efficient service placement remains challenging due to dynamic user mobility, limited edge resources, and the need to manage service migration as system conditions evolve. This study proposes a resource-aware, cache-enabled service placement framework based on deep reinforcement learning (DRL) to dynamically select edge nodes for hosting services. The approach jointly considers user location, resource availability, and cache status within a unified decision framework, enabling efficient and adaptive service placement in dynamic MEC environments. The problem is formulated as a Markov decision process (MDP) and solved using deep Q-network (DQN)-based methods, with a reward function that balances latency, resource utilization, and cache efficiency. The proposed framework is evaluated in a simulated MEC environment with mobile users and multiple edge nodes. Experimental results demonstrate that the approach achieves lower latency, improved resource utilization, and enhanced cache efficiency compared to baseline strategies. Among the evaluated models, the dueling double deep Q-network (DDDQN) achieves the most balanced overall performance. The proposed framework provides an adaptive and scalable solution for service management in dynamic MEC environments. Full article
(This article belongs to the Special Issue Machine Learning Approach for Prediction: Cross-Domain Applications)
Show Figures

Figure 1

25 pages, 12577 KB  
Article
A Hybrid Deep Learning Framework with Q-Table Optimization for Well Log Reconstruction
by Hangju Yu and Bin Zhao
Processes 2026, 14(10), 1548; https://doi.org/10.3390/pr14101548 - 11 May 2026
Viewed by 262
Abstract
The reconstruction of acoustic (AC) logging curves is of great significance for reservoir evaluation, lithology identification, and velocity modeling, particularly in the presence of missing or degraded logging data. However, conventional reconstruction methods and existing deep learning models often suffer from limited feature [...] Read more.
The reconstruction of acoustic (AC) logging curves is of great significance for reservoir evaluation, lithology identification, and velocity modeling, particularly in the presence of missing or degraded logging data. However, conventional reconstruction methods and existing deep learning models often suffer from limited feature representation capability and rely heavily on manual hyperparameter tuning, leading to suboptimal performance. To address these challenges, this study proposes a reinforcement learning-based optimization framework for AC logging curve reconstruction. Specifically, a hybrid deep learning architecture integrating convolutional neural networks (CNNs), bidirectional long short-term memory (BiLSTM), and an attention mechanism is developed to effectively capture local spatial features, long-range temporal dependencies, and key feature contributions from multi-logging data. Furthermore, a Q-learning-based optimization strategy is introduced to adaptively tune model hyperparameters by formulating the optimization process as a Markov Decision Process (MDP), enabling dynamic and data-driven parameter adjustment. To validate the effectiveness of the proposed method, comparative experiments are conducted using several baseline and optimized models, including CNN–BiLSTM, CNN–BiLSTM–Attention, particle swarm optimization (PSO)-optimized CNN–BiLSTM–Attention, and genetic algorithm (GA)-optimized CNN–BiLSTM–Attention. The results demonstrate that the proposed approach achieves superior reconstruction accuracy for AC curves, with improved convergence efficiency and model stability. In addition, it exhibits stronger robustness and generalization capability under limited data conditions, effectively mitigating the risk of overfitting and local optima. This study provides a novel reinforcement learning-driven solution for AC logging curve reconstruction and offers practical value for intelligent reservoir characterization in complex geological environments. Full article
(This article belongs to the Section Petroleum and Low-Carbon Energy Process Engineering)
Show Figures

Figure 1

36 pages, 9578 KB  
Article
Electric Vehicle Charging and Discharging Scheduling Method Based on Clustering and Deep Reinforcement Learning
by Chunqi He and Jiang Li
Energies 2026, 19(9), 2238; https://doi.org/10.3390/en19092238 - 6 May 2026
Viewed by 372
Abstract
With the large-scale integration of electric vehicles (EVs) into the power grid, uncoordinated charging behavior has aggravated load fluctuations in the power system. Deep reinforcement learning can optimize EV charging and discharging strategies through dynamic decision-making, thereby alleviating the operational pressure imposed on [...] Read more.
With the large-scale integration of electric vehicles (EVs) into the power grid, uncoordinated charging behavior has aggravated load fluctuations in the power system. Deep reinforcement learning can optimize EV charging and discharging strategies through dynamic decision-making, thereby alleviating the operational pressure imposed on the grid by load variations. However, under large-scale EV integration scenarios, challenges still remain, including the excessively high dimensionality of the state space and the resulting decline in training efficiency. In addition, the coupling between existing clustering methods and dynamic scheduling mechanisms is still insufficiently tight. To address these issues, this study proposes a cluster-based deep reinforcement learning method for EV charging and discharging scheduling, referred to as CDRL. First, a probabilistic behavioral model is constructed based on EV charging transaction data to characterize the stochasticity of user charging behavior. A Density–Centroid Hybrid Clustering (DCHC) method is then adopted to cluster the charging behavior characteristics of EVs. Subsequently, at the cluster level, a day-ahead base load forecasting model is introduced, and the forecasting results are fed into a mixed-integer linear programming (MILP) model to generate the charging and discharging power allocation tasks for each cluster. At the individual level, the EV charging and discharging process is formulated as a Markov decision process (MDP), and a deep Q-network (DQN) is employed for policy learning, thereby achieving the decomposition of cluster-level tasks into individual scheduling decisions. The simulation results demonstrate that the proposed method can effectively reduce charging costs and smooth system load fluctuations while improving training convergence speed and policy stability. Full article
(This article belongs to the Section E: Electric Vehicles)
Show Figures

Figure 1

24 pages, 1245 KB  
Article
Bio-Inspired Energy-Efficient Routing for Wireless Sensor Networks Based on Honeybee Foraging Behavior and MDP-Driven Adaptive Scheduling
by Fangyan Chen, Xiangcheng Wu, Weimin Qi, Zhiming Wang, Zhiyu Wang and Peng Li
Biomimetics 2026, 11(5), 311; https://doi.org/10.3390/biomimetics11050311 - 1 May 2026
Viewed by 610
Abstract
Wireless Sensor Networks (WSNs) enable energy-efficient data collection in dynamic environments but continue to face the dual challenges of severely constrained node energy and the spatiotemporal heterogeneity of data traffic. Inspired by honeybee foraging behavior, this paper proposes a hybrid optimization framework that [...] Read more.
Wireless Sensor Networks (WSNs) enable energy-efficient data collection in dynamic environments but continue to face the dual challenges of severely constrained node energy and the spatiotemporal heterogeneity of data traffic. Inspired by honeybee foraging behavior, this paper proposes a hybrid optimization framework that integrates mixed-integer linear programming (MILP) and Markov decision processes (MDP), utilizing Q-learning for adaptive decision-making. The proposed framework systematically maps the dual-layer decision-making mechanism of honeybee foraging onto a synergistic architecture combining MILP-based global planning and MDP-based local adaptation, offering a novel bio-inspired solution for mobile sink trajectory planning and adaptive routing. Specifically, the upper-level MILP module simulates a colony-level global assessment of distant nectar sources, generating an initial global trajectory by determining the optimal access sequence of cluster heads to minimize the movement cost of the mobile sink. The lower-level Q-learning module simulates the individual-level local adaptation, where bees adjust harvesting behavior in real-time based on nectar quality and distance. This module continuously optimizes routing parameters based on real-time network states, including residual energy, the ratio of surviving nodes, data queue lengths, and cluster head density. The algorithm employs an ϵ-greedy strategy to balance exploration and exploitation, while a periodic decision-update mechanism is introduced to harmonize computational efficiency with learning stability. Furthermore, a multi-objective reward function is designed to jointly optimize energy efficiency, network lifetime, end-to-end latency, and path length. Extensive simulation results demonstrate that the proposed MILP-MDP hybrid framework significantly outperforms several representative baseline algorithms in terms of network lifetime extension and energy balance. These findings validate that the integration of bio-inspired foraging strategies and reinforcement learning provides an efficient and robust solution for trajectory planning and adaptive routing in dynamic WSNs. Full article
(This article belongs to the Special Issue Bionics in Engineering Practice: Innovations and Applications)
Show Figures

Figure 1

22 pages, 4808 KB  
Article
Transforming Opportunistic Routing: A Deep Reinforcement Learning Framework for Reliable and Energy-Efficient Communication in Mobile Cognitive Radio Sensor Networks
by Suleiman Zubair, Bala Alhaji Salihu, Altyeb Altaher Taha, Yakubu Suleiman Baguda, Ahmed Hamza Osman and Asif Hassan Syed
IoT 2026, 7(2), 34; https://doi.org/10.3390/iot7020034 - 21 Apr 2026
Viewed by 607
Abstract
The Mobile Reliable Opportunistic Routing (MROR) protocol improves data-forwarding reliability in Cognitive Radio Sensor Networks (CRSNs) through mobility-aware virtual contention groups and handover zoning. However, its heuristic decision logic is difficult to optimize under highly dynamic spectrum access and random node mobility. To [...] Read more.
The Mobile Reliable Opportunistic Routing (MROR) protocol improves data-forwarding reliability in Cognitive Radio Sensor Networks (CRSNs) through mobility-aware virtual contention groups and handover zoning. However, its heuristic decision logic is difficult to optimize under highly dynamic spectrum access and random node mobility. To address this limitation, we present DRL-MROR, a refined routing framework that incorporates deep reinforcement learning (DRL) to enable intelligent and adaptive forwarding decisions. In DRL-MROR, the secondary users (SUs) act as autonomous agents that observe local state information, including primary-user activity, link quality, residual energy, and neighbor-mobility patterns. Each agent learns a forwarding policy through a Deep Q-Network (DQN) optimized for long-term network utility in terms of throughput, delay, and energy efficiency. We formulate routing as a Markov Decision Process (MDP) and use experience replay with prioritized sampling to improve learning stability and convergence. The DQN used at each node is intentionally lightweight, requiring 5514 trainable parameters, about 21.5 kB of weight storage in 32-bit precision, and approximately 5.4k multiply-accumulate operations per inference, which supports practical deployment on edge-capable CRSN nodes. Extensive simulations show that DRL-MROR outperforms the original MROR protocol and representative AI-based routing baselines such as AIRoute under diverse operating conditions. The results indicate gains of up to 38% in throughput, 42% in goodput, a 29% reduction in energy consumed per packet, and an approximately 18% improvement in network lifetime, while maintaining high route stability and fairness. DRL-MROR also reduces control overhead by about 30% and average end-to-end delay by up to 32%, maintaining strong performance even under elevated PU activity and higher node mobility. These results show that augmenting opportunistic routing with lightweight DRL can substantially improve adaptability and efficiency in next-generation IoT-oriented CRSNs. Full article
(This article belongs to the Special Issue Advances in Wireless Communication Technologies for IoT Devices)
Show Figures

Graphical abstract

25 pages, 9434 KB  
Article
Adaptive Bit Selection via Deep Reinforcement Learning for Large-Scale Image Hashing
by Mitra Rezaei, Mohammed Ayoub Alaoui Mhamdi and Madjid Allili
Electronics 2026, 15(8), 1735; https://doi.org/10.3390/electronics15081735 - 20 Apr 2026
Cited by 1 | Viewed by 410
Abstract
Image hashing enables efficient large-scale image retrieval by encoding high-dimensional visual data into compact binary representations. However, existing deep hashing methods typically learn fixed-length hash codes in a fully supervised manner, often generating redundant bits that limit discriminative capability and increase storage overhead. [...] Read more.
Image hashing enables efficient large-scale image retrieval by encoding high-dimensional visual data into compact binary representations. However, existing deep hashing methods typically learn fixed-length hash codes in a fully supervised manner, often generating redundant bits that limit discriminative capability and increase storage overhead. In this paper, we propose a deep reinforcement learning-based adaptive bit selection framework for compact image hashing. We formulate hash refinement as a Markov Decision Process (MDP) and employ a Proximal Policy Optimization (PPO) agent to selectively retain the most informative hash bits while discarding redundant ones, directly optimizing retrieval performance through mean Average Precision (mAP). The proposed approach integrates CNN-based hash extraction with reinforcement-driven adaptive regeneration, producing compact yet highly discriminative binary codes. Extensive experiments on standard image retrieval benchmarks demonstrate consistent improvements over state-of-the-art deep hashing methods in terms of retrieval accuracy and efficiency, highlighting the effectiveness of reinforcement learning for adaptive representation learning in intelligent large-scale retrieval systems. Full article
Show Figures

Figure 1

30 pages, 2640 KB  
Article
Environment-Aware Optimal Placement and Dynamic Reconfiguration of Underwater Robotic Sonar Networks Using Deep Reinforcement Learning
by Qiming Sang, Yu Tian, Jin Zhang, Yuyang Xiao, Zhiduo Tan, Jiancheng Yu and Fumin Zhang
J. Mar. Sci. Eng. 2026, 14(8), 733; https://doi.org/10.3390/jmse14080733 - 15 Apr 2026
Viewed by 484
Abstract
Underwater dynamic target detection, classification, localization, and tracking (DCLT) is central to maritime surveillance and monitoring and increasingly relies on distributed AUV-based robotic sonar networks operating in passive listening and, when required, cooperative multistatic modes. Achieving a robust performance in realistic oceans remains [...] Read more.
Underwater dynamic target detection, classification, localization, and tracking (DCLT) is central to maritime surveillance and monitoring and increasingly relies on distributed AUV-based robotic sonar networks operating in passive listening and, when required, cooperative multistatic modes. Achieving a robust performance in realistic oceans remains challenging, because sensor placement must adapt to time-varying acoustic conditions and target priors while preserving acoustic communication connectivity, and because frequent reconfiguration under dynamic currents makes classical large-scale planning computationally expensive. This paper presents an integrated deep reinforcement learning (DRL)-based framework for passive-stage sonar placement and dynamic reconfiguration in distributed AUV networks. First, we cast placement as a constructive finite-horizon Markov decision process (MDP) and train a Proximal Policy Optimization (PPO) agent to sequentially build a collision-free layout on a discretized surveillance grid. The terminal reward is formulated to jointly optimize the environment-aware detection performance, computed from BELLHOP-based transmission loss models, and global network connectivity, quantified using algebraic connectivity. Second, to enable time-critical reconfiguration, we estimate flow-aware motion costs for all AUV–destination pairs using a PPO with a Long Short-Term Memory (LSTM) trajectory policy trained for partial observability. The learned policy can be deployed onboard, allowing each AUV to refine its path online using locally sensed currents, improving robustness to ocean-model uncertainty. The resulting cost matrix is solved via an efficient zero-element assignment method to obtain the optimal one-to-one reassignment. In the reported simulation studies, the proposed Sequential PPO placement method achieves a final reward 16–21% higher than Particle Swarm Optimization (PSO) and 2–3.7% higher than the Genetic Algorithm (GA), while the proposed PPO + LSTM planner reduces average travel time by 30.44% compared with A*. The proposed closed-loop architecture supports frequent re-optimization, scalable fleet operation, and a seamless transition to communication-supported cooperative multistatic tracking after detection, enabling efficient, adaptive DCLT in dynamic marine environments. Full article
(This article belongs to the Section Ocean Engineering)
Show Figures

Figure 1

22 pages, 5390 KB  
Article
Joint Optimization of Time Slot and Power Allocation in Underwater Acoustic Communication Networks
by Xuan Geng and Yongkang Hu
Sensors 2026, 26(7), 2188; https://doi.org/10.3390/s26072188 - 1 Apr 2026
Viewed by 568
Abstract
This paper proposes a joint optimization algorithm based on reinforcement learning to address the time slot and power allocation problem in underwater acoustic communication networks (UACNs). By maximizing the total capacity of successful transmissions as the optimization objective, two sub-objectives are formulated corresponding [...] Read more.
This paper proposes a joint optimization algorithm based on reinforcement learning to address the time slot and power allocation problem in underwater acoustic communication networks (UACNs). By maximizing the total capacity of successful transmissions as the optimization objective, two sub-objectives are formulated corresponding to time-slot scheduling and power allocation. The sub-objective corresponding to time-slot scheduling is addressed by constructing a Markov Decision Process (MDP) model based on Deep Q-Network (DQN) learning. In this model, the agent learns the time slot allocation policy with the goal of increasing the number of successfully transmitted links while reducing the collision. For the sub-objective corresponding to power allocation, another MDP model is developed, solved by the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm, in which each underwater transmission node acts as an independent agent. The MADDPG approach enables the system to improve channel capacity under energy limitation, which maximizes the total capacity of successfully transmitted links. In terms of model execution, the DQN adopts a centralized training and time slot allocation, while MADDPG uses a centralized training and distributed execution to select the transmission power by each node. Simulation results show that the proposed joint optimization algorithm demonstrates better performance in the number of successfully transmitted links and channel capacity compared to TDMA, Slotted ALOHA, and other algorithms. Full article
(This article belongs to the Special Issue Sensor Networks and Communication with AI)
Show Figures

Figure 1

Back to TopTop