Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (160)

Search Parameters:
Keywords = advantage actor–critic

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
18 pages, 3274 KB  
Article
MEC-Chain: Towards a New Framework for a MEC-Enabled Mobile Blockchain Network Under the PoS Consensus
by Rima Grati, Khouloud Boukadi and Safa Elleuch
Future Internet 2025, 17(12), 563; https://doi.org/10.3390/fi17120563 - 5 Dec 2025
Viewed by 254
Abstract
The Proof of Stake (PoS) consensus mechanism is increasingly used in blockchain systems; however, resource allocation for PoS-based mobile blockchain networks remains underexplored, particularly given the constraints of mobile devices. This work introduces MEC-Chain, a new framework that integrates Mobile Edge Computing (MEC) [...] Read more.
The Proof of Stake (PoS) consensus mechanism is increasingly used in blockchain systems; however, resource allocation for PoS-based mobile blockchain networks remains underexplored, particularly given the constraints of mobile devices. This work introduces MEC-Chain, a new framework that integrates Mobile Edge Computing (MEC) with mobile blockchain to support efficient validator-node execution under PoS. MEC-Chain formalizes a multi-objective resource-allocation problem that jointly considers latency, reliability, and cost from both the validator and MEC-provider perspectives. To address this challenge, we develop a deep reinforcement learning-based allocation agent using the Proximal Policy Optimization (PPO) algorithm. Experimental results show that PPO achieves a 30–40% reduction in total execution time, 25–35% lower transmission latency, and 10–15% higher reliability compared to A2C (Advantage Actor–Critic) and DQN (Deep Q-Network), while offering comparable cost savings across all methods. These results demonstrate the effectiveness of MEC-Chain in enabling low-latency, reliable, and resource-efficient PoS validation within mobile blockchain environments. Full article
Show Figures

Graphical abstract

23 pages, 2592 KB  
Article
Reinforcement Learning-Based Vehicle Control in Mixed-Traffic Environments with Driving Style-Aware Trajectory Prediction
by Xiaopeng Zhang, Lin Wang, Yipeng Zhang and Zewei Feng
Sustainability 2025, 17(24), 10889; https://doi.org/10.3390/su172410889 - 5 Dec 2025
Viewed by 386
Abstract
The heterogeneity of human driving styles in mixed-traffic environments manifests as divergent decision-making behaviors in complex scenarios like highway merging. By accurately recognizing these driving styles and predicting corresponding trajectories, autonomous vehicles can enhance safety, improve traffic efficiency, and concurrently achieve fuel savings [...] Read more.
The heterogeneity of human driving styles in mixed-traffic environments manifests as divergent decision-making behaviors in complex scenarios like highway merging. By accurately recognizing these driving styles and predicting corresponding trajectories, autonomous vehicles can enhance safety, improve traffic efficiency, and concurrently achieve fuel savings in highway merging scenarios. This paper proposes a novel framework wherein a clustering algorithm first establishes statistical priors of driving styles. These priors are then integrated into a Model Predictive Control (MPC) model that leverages Bayesian inference to generate a probability-aware trajectory prediction. Finally, this predicted trajectory is embedded as a component of the state input to a reinforcement learning agent, which is trained using an Actor–Critic architecture to learn the optimal control policy. Experimental results validate the significant superiority of the proposed framework. Under the most challenging high-density traffic scenarios, our method boosts the evaluation reward by 11.26% and the average speed by 10.08% compared to the baseline Multi-Agent Proximal Policy Optimization (MAPPO) algorithm. This advantage also persists in low-density scenarios, where a steady 10.60% improvement in evaluation reward is achieved. These findings confirm that the proposed integrated approach provides an effective decision-making solution for autonomous vehicles, capable of substantially enhancing interaction safety and traffic efficiency in emerging mixed-traffic environments. Full article
Show Figures

Figure 1

18 pages, 1954 KB  
Article
Greenhouse Irrigation Control Based on Reinforcement Learning
by Juan Pablo Padilla-Nates, Leonardo D. Garcia, Camilo Lozoya, Luis Orona and Aldo Cortes-Perez
Agronomy 2025, 15(12), 2781; https://doi.org/10.3390/agronomy15122781 - 2 Dec 2025
Viewed by 895
Abstract
Precision irrigation provides a sustainable approach to enhancing water efficiency while maintaining crop productivity. This study evaluates a reinforcement learning approach, using the advantage actor–critic algorithm, for closed-loop irrigation control in a greenhouse environment. The reinforcement learning control is designed to regulate soil [...] Read more.
Precision irrigation provides a sustainable approach to enhancing water efficiency while maintaining crop productivity. This study evaluates a reinforcement learning approach, using the advantage actor–critic algorithm, for closed-loop irrigation control in a greenhouse environment. The reinforcement learning control is designed to regulate soil moisture near the maximum allowable depletion threshold, minimizing water use without compromising plant health. Its performance is compared against two common strategies: an on–off closed-loop controller and a time-based open-loop controller. The results show that the proposed controller consistently reduces irrigation water consumption relative to both benchmarks, while adapting effectively to environmental variability and the crop’s increasing water demand during growth. These findings highlight the potential of reinforcement learning to achieve a more efficient balance between water conservation and crop health in controlled agricultural systems. Full article
Show Figures

Figure 1

35 pages, 4264 KB  
Article
Smart Tangency Portfolio: Deep Reinforcement Learning for Dynamic Rebalancing and Risk–Return Trade-Off
by Jiayang Yu and Kuo-Chu Chang
Int. J. Financial Stud. 2025, 13(4), 227; https://doi.org/10.3390/ijfs13040227 - 2 Dec 2025
Viewed by 958
Abstract
This paper proposes a dynamic portfolio allocation framework that integrates deep reinforcement learning (DRL) with classical portfolio optimization to enhance rebalancing strategies and risk–return management. Within a unified reinforcement-learning environment for portfolio reallocation, we train actor–critic agents (Proximal Policy Optimization (PPO) and Advantage [...] Read more.
This paper proposes a dynamic portfolio allocation framework that integrates deep reinforcement learning (DRL) with classical portfolio optimization to enhance rebalancing strategies and risk–return management. Within a unified reinforcement-learning environment for portfolio reallocation, we train actor–critic agents (Proximal Policy Optimization (PPO) and Advantage Actor–Critic (A2C)). These agents learn to select both the risk-aversion level—positioning the portfolio along the efficient frontier defined by expected return and a chosen risk measure (variance, Semivariance, or CVaR)—and the rebalancing horizon. An ensemble procedure, which selects the most effective agent–utility combination based on the Sharpe ratio, provides additional robustness. Unlike approaches that directly estimate portfolio weights, our framework retains the optimization structure while delegating the choice of risk level and rebalancing interval to the AI agent, thereby improving stability and incorporating a market-timing component. Empirical analysis on daily data for 12 U.S. sector ETFs (2003–2023) and 28 Dow Jones Industrial Average components (2005–2023) demonstrates that DRL-guided strategies consistently outperform static tangency portfolios and market benchmarks in annualized return, volatility, and Sharpe ratio. These findings underscore the potential of DRL-driven rebalancing for adaptive portfolio management. Full article
(This article belongs to the Special Issue Financial Markets: Risk Forecasting, Dynamic Models and Data Analysis)
Show Figures

Figure 1

24 pages, 4899 KB  
Article
Crystallization Process Optimization Using Hybrid Tomographic Imaging and Deep Reinforcement Learning for Sustainable Energy Systems
by Konrad Niderla, Tomasz Rymarczyk, Grzegorz Kłosowski, Monika Kulisz, Grzegorz Bartnik, Paweł Kaleta, Emanuel Józefacki and Dariusz Dudek
Energies 2025, 18(23), 6193; https://doi.org/10.3390/en18236193 - 26 Nov 2025
Viewed by 404
Abstract
Crystallization is a fundamental unit operation in chemical, pharmaceutical, and energy industries, where strict control of crystal size distribution (CSD) is essential for ensuring product quality and process efficiency. However, the nonlinear dynamics of crystallization and the absence of explicit functional relationships between [...] Read more.
Crystallization is a fundamental unit operation in chemical, pharmaceutical, and energy industries, where strict control of crystal size distribution (CSD) is essential for ensuring product quality and process efficiency. However, the nonlinear dynamics of crystallization and the absence of explicit functional relationships between process variables make effective control a significant challenge. This study proposes a hybrid approach that integrates process tomography with deep reinforcement learning (RL) for adaptive crystallization control. A dedicated hybrid tomographic system, combining Electrical Impedance Tomography (EIT) and Ultrasound Tomography (UST), was developed to provide complementary real-time spatial information, while a ResNet neural network enabled accurate image reconstruction. These data were used as input to a reinforcement learning agent operating in a Simulink-based simulation environment, where temperature was selected as the primary controlled variable. To evaluate the applicability of RL in this context, four representative algorithms: Actor–Critic, Asynchronous Advantage Actor–Critic, Proximal Policy Optimization (PPO), and Trust Region Policy Optimization, were implemented and compared. The results demonstrate that PPO achieved the most stable and effective performance, ensuring improved control of CSD and improved control proxies consistent with potential energy savings. The findings confirm that hybrid tomographic imaging combined with RL-based control provides a promising pathway toward sustainable, intelligent crystallization processes with enhanced product quality and energy efficiency. Full article
Show Figures

Figure 1

31 pages, 3746 KB  
Article
An Advantage Actor–Critic-Based Quality of Service-Aware Routing Optimization Mechanism for Optical Satellite Network
by Wei Zhou, Bingli Guo, Xiaodong Liang, Qingsong Luo, Boying Cao, Zongxiang Xie, Ligen Qiu, Xinjie Shen and Bitao Pan
Photonics 2025, 12(12), 1148; https://doi.org/10.3390/photonics12121148 - 22 Nov 2025
Viewed by 282
Abstract
To support the 6G vision of seamless “space–air–ground-integrated” global coverage, optical satellite networks must enable high-speed, low-latency, and intelligent data transmission. However, conventional inter-satellite laser link-based optical transport networks suffer from inefficient bandwidth utilization and nonlinear latency accumulation caused by multi-hop routing, which [...] Read more.
To support the 6G vision of seamless “space–air–ground-integrated” global coverage, optical satellite networks must enable high-speed, low-latency, and intelligent data transmission. However, conventional inter-satellite laser link-based optical transport networks suffer from inefficient bandwidth utilization and nonlinear latency accumulation caused by multi-hop routing, which severely limits their ability to support ultra-low-latency and real-time applications. To address the critical challenges of high topological complexity and stringent real-time requirements in satellite elastic optical networks, we propose an asynchronous advantage actor–critic-based quality of service-aware routing optimization mechanism for the optical inter-satellite link (OISL-AQROM). By establishing a quantitative model that correlates the optical service unit (OSU) C value with node hop count, the algorithm enhances the performance of latency-sensitive services in dynamic satellite environments. Simulation results conducted on a Walker-type low Earth orbit (LEO) constellation comprising 1152 satellites demonstrate that OISL-AQROM reduces end-to-end latency by 76.3% to 37.6% compared to the traditional heuristic multi-constrained shortest path first (MCSPF) algorithm, while supporting fine-grained dynamic bandwidth adjustment down to a minimum granularity of 2.6 Mbps. Furthermore, OISL-AQROM exhibits strong convergence and robust stability across diverse traffic loads, consistently outperforming MCSPF and deep deterministic policy gradient (DDPG) algorithm in overall efficiency, load adaptability, and operational reliability. The proposed algorithm significantly improves service quality and transmission efficiency in commercial mega-constellation optical satellite networks, demonstrating engineering applicability and potential for practical deployment in future 6G infrastructure. Full article
(This article belongs to the Special Issue Emerging Technologies for 6G Space Optical Communication Networks)
Show Figures

Figure 1

25 pages, 1326 KB  
Article
UAV-Mounted Base Station Coverage and Trajectory Optimization Using LSTM-A2C with Attention
by Yonatan M. Worku, Christos Christodoulou and Michael Devetsikiotis
Drones 2025, 9(11), 787; https://doi.org/10.3390/drones9110787 - 12 Nov 2025
Viewed by 691
Abstract
In disaster relief operations, Unmanned Aerial Vehicles (UAVs) equipped with base stations (UAV-BS) are vital for re-establishing communication networks where conventional infrastructure has been compromised. Optimizing their trajectories and coverage to ensure equitable service delivery amidst obstacles, wind effects, and energy limitations remains [...] Read more.
In disaster relief operations, Unmanned Aerial Vehicles (UAVs) equipped with base stations (UAV-BS) are vital for re-establishing communication networks where conventional infrastructure has been compromised. Optimizing their trajectories and coverage to ensure equitable service delivery amidst obstacles, wind effects, and energy limitations remains a formidable challenge. This paper proposes an innovative reinforcement learning framework leveraging a Long Short-Term Memory (LSTM)-based Advantage Actor–Critic (A2C) model enhanced with an attention mechanism. Operating within a grid-based disaster environment, our approach seeks to maximize fair coverage for randomly distributed ground users under tight energy constraints. It incorporates a nine-direction movement model and a fairness-focused communication strategy that prioritizes unserved users, thereby improving both equity and efficiency. The attention mechanism enhances adaptability by directing focus to critical areas, such as clusters of unserved users. Simulation results reveal that our method surpasses baseline reinforcement learning techniques in coverage fairness, Quality of Service (QoS), and energy efficiency, providing a scalable and effective solution for real-time disaster response. Full article
(This article belongs to the Special Issue Space–Air–Ground Integrated Networks for 6G)
Show Figures

Figure 1

19 pages, 2339 KB  
Article
Coded Caching Optimization in Dual Time-Scale Wireless Networks: An Advantage Actor–Critic Learning Approach
by Jiajie Ren and Chang Guo
Appl. Sci. 2025, 15(22), 11915; https://doi.org/10.3390/app152211915 - 9 Nov 2025
Viewed by 430
Abstract
The rapid growth of mobile data traffic poses significant challenges to ensuring high-quality service in wireless networks. Although the caching technique is capable of alleviating network congestion, most existing schemes depend on uncoded caching with prior knowledge of content popularity and ignore the [...] Read more.
The rapid growth of mobile data traffic poses significant challenges to ensuring high-quality service in wireless networks. Although the caching technique is capable of alleviating network congestion, most existing schemes depend on uncoded caching with prior knowledge of content popularity and ignore the time-scale mismatch between content dynamics and user mobility. To address these challenges, we first formulate a dynamic coded caching optimization framework under a dual time-scale model that simultaneously captures long-term content popularity evolution and short-term user mobility patterns. Then, we model the optimization problem as a Markov decision process and design a novel advantage actor–critic (A2C) based coded caching algorithm. By introducing the advantage function, the proposed approach can mitigate variance in policy updates and accelerate convergence under the caching capacity constraint. Finally, extensive simulations are conducted to demonstrate that our proposed algorithm significantly outperforms baseline caching schemes in terms of average delay cost. Full article
Show Figures

Figure 1

25 pages, 568 KB  
Article
Exploring the Psychological and Social Dynamics of Steroid and Performance-Enhancing Drug (PED) Use Among Late Adolescents and Emerging Adults (16–22): A Thematic Analysis
by Metin Çınaroğlu, Eda Yılmazer and Esra Noyan Ahlatcıoğlu
Adolescents 2025, 5(4), 63; https://doi.org/10.3390/adolescents5040063 - 27 Oct 2025
Viewed by 1724
Abstract
Background: Performance-enhancing drug (PED) use has become increasingly prevalent among adolescents and emerging adults, not solely for athletic advantage but as a psychological and sociocultural coping mechanism. In Türkiye, where Westernized body ideals intersect with traditional values, the emotional and symbolic meanings of [...] Read more.
Background: Performance-enhancing drug (PED) use has become increasingly prevalent among adolescents and emerging adults, not solely for athletic advantage but as a psychological and sociocultural coping mechanism. In Türkiye, where Westernized body ideals intersect with traditional values, the emotional and symbolic meanings of PED use among youth remain underexplored. Methods: This qualitative study employed semi-structured interviews and reflexive thematic analysis to examine the subjective experiences of 26 Turkish adolescents and emerging adults (19 males, 7 females; ages 16–22) in Istanbul who reported non-medical use of steroids or other PEDs. Participants were recruited through snowball sampling in gym-adjacent communities across six urban districts. Interviews were conducted online, recorded, transcribed, and analyzed to identify emergent psychological themes. Results: Six interconnected themes were identified: (1) body-based insecurity and the fantasy of reinvention; (2) emotional regulation through bodily control; (3) secrecy as autonomy; (4) compulsive enhancement and dissatisfaction; (5) psychological dependency and regret; and (6) PED use as agency and protest. While male and female participants differed in aesthetic goals and social narratives, both groups framed PED use as a means of identity construction, emotional survival, and social validation. Participants did not perceive themselves as deviant but as strategic actors navigating a performance-driven culture. Conclusions: PED use among youth in urban Türkiye emerges as a psychologically embedded coping mechanism rooted in emotional regulation, self-concept, and perceived control. Rather than a deviant behavior, it reflects an adaptive but precarious strategy for managing insecurity and achieving recognition during a critical developmental stage. Full article
Show Figures

Figure 1

24 pages, 2291 KB  
Article
Achieving Computational Symmetry: A Novel Workflow Task Scheduling and Resource Allocation Method for D2D Cooperation
by Xianzhi Cao, Chang Lv, Jiali Li and Jian Wang
Symmetry 2025, 17(10), 1746; https://doi.org/10.3390/sym17101746 - 16 Oct 2025
Viewed by 546
Abstract
With the rapid advancement of mobile edge computing and Internet of Things (IoT) technologies, device-to-device (D2D) cooperative computing has garnered significant attention due to its low latency and high resource utilization efficiency. However, workflow task scheduling in D2D networks poses considerable challenges, such [...] Read more.
With the rapid advancement of mobile edge computing and Internet of Things (IoT) technologies, device-to-device (D2D) cooperative computing has garnered significant attention due to its low latency and high resource utilization efficiency. However, workflow task scheduling in D2D networks poses considerable challenges, such as severe heterogeneity in device resources and complex inter-task dependencies, which may result in low resource utilization and inefficient scheduling, ultimately breaking the computational symmetry—a balanced state of computational resource allocation among terminal devices and load balance across the network. To address these challenges and restore system-level symmetry, a novel workflow task scheduling method tailored for D2D cooperative environments is proposed. First, a Non-dominated Sorting Genetic Algorithm (NSGA) is employed to optimize the allocation of computational resources across terminal devices, maximizing the overall computing capacity while achieving a symmetrical and balanced resource distribution. A scoring mechanism and a normalization strategy are introduced to accurately assess the compatibility between tasks and processors, thereby enhancing resource utilization during scheduling. Subsequently, task priorities are determined based on the calculation of each task’s Shapley value, ensuring that critical tasks are scheduled preferentially. Finally, a hybrid algorithm integrating Q-learning with Asynchronous Advantage Actor–Critic (A3C) is developed to perform precise and adaptive task scheduling, improving system load balancing and execution efficiency. Extensive simulation results demonstrate that the proposed method outperforms state-of-art methods in both energy consumption and response time, with improvements of 26.34% and 29.98%, respectively, underscoring the robustness and superiority of the proposed method. Full article
(This article belongs to the Section Computer)
Show Figures

Figure 1

25 pages, 3060 KB  
Article
Curiosity-Driven Exploration in Reinforcement Learning: An Adaptive Self-Supervised Learning Approach for Playing Action Games
by Sehar Shahzad Farooq, Hameedur Rahman, Samiya Abdul Wahid, Muhammad Alyan Ansari, Saira Abdul Wahid and Hosu Lee
Computers 2025, 14(10), 434; https://doi.org/10.3390/computers14100434 - 13 Oct 2025
Viewed by 1881
Abstract
Games are considered a suitable and standard benchmark for checking the performance of artificial intelligence-based algorithms in terms of training, evaluating, and comparing the performance of AI agents. In this research, an application of the Intrinsic Curiosity Module (ICM) and the Asynchronous Advantage [...] Read more.
Games are considered a suitable and standard benchmark for checking the performance of artificial intelligence-based algorithms in terms of training, evaluating, and comparing the performance of AI agents. In this research, an application of the Intrinsic Curiosity Module (ICM) and the Asynchronous Advantage Actor–Critic (A3C) algorithm is explored using action games. Having been proven successful in several gaming environments, its effectiveness in action games is rarely explored. Providing efficient learning and adaptation facilities, this research aims to assess whether integrating ICM with A3C promotes curiosity-driven explorations and adaptive learning in action games. Using the MAME Toolkit library, we interface with the game environments, preprocess game screens to focus on relevant visual elements, and create diverse game episodes for training. The A3C policy is optimized using the Proximal Policy Optimization (PPO) algorithm with tuned hyperparameters. Comparisons are made with baseline methods, including vanilla A3C, ICM with pixel-based predictions, and state-of-the-art exploration techniques. Additionally, we evaluate the agent’s generalization capability in separate environments. The results demonstrate that ICM and A3C effectively promote curiosity-driven exploration in action games, with the agent learning exploration behaviors without relying solely on external rewards. Notably, we also observed an improved efficiency and learning speed compared to baseline approaches. This research contributes to curiosity-driven exploration in reinforcement learning-based virtual environments and provides insights into the exploration of complex action games. Successfully applying ICM and A3C in action games presents exciting opportunities for adaptive learning and efficient exploration in challenging real-world environments. Full article
Show Figures

Figure 1

26 pages, 2589 KB  
Article
Vision-Based Adaptive Control of Robotic Arm Using MN-MD3+BC
by Xianxia Zhang, Junjie Wu and Chang Zhao
Appl. Sci. 2025, 15(19), 10569; https://doi.org/10.3390/app151910569 - 30 Sep 2025
Viewed by 709
Abstract
Aiming at the problems of traditional calibrated visual servo systems relying on precise model calibration and the high training cost and low efficiency of online reinforcement learning, this paper proposes a Multi-Network Mean Delayed Deep Deterministic Policy Gradient Algorithm with Behavior Cloning (MN-MD3+BC) [...] Read more.
Aiming at the problems of traditional calibrated visual servo systems relying on precise model calibration and the high training cost and low efficiency of online reinforcement learning, this paper proposes a Multi-Network Mean Delayed Deep Deterministic Policy Gradient Algorithm with Behavior Cloning (MN-MD3+BC) for uncalibrated visual adaptive control of robotic arms. The algorithm improves upon the Twin Delayed Deep Deterministic Policy Gradient (TD3) network framework by adopting an architecture with one actor network and three critic networks, along with corresponding target networks. By constructing a multi-critic network integration mechanism, the mean output of the networks is used as the final Q-value estimate, effectively reducing the estimation bias of a single critic network. Meanwhile, a behavior cloning regularization term is introduced to address the common distribution shift problem in offline reinforcement learning. Furthermore, to obtain a high-quality dataset, an innovative data recombination-driven dataset creation method is proposed, which reduces training costs and avoids the risks of real-world exploration. The trained policy network is embedded into the actual system as an adaptive controller, driving the robotic arm to gradually approach the target position through closed-loop control. The algorithm is applied to uncalibrated multi-degree-of-freedom robotic arm visual servo tasks, providing an adaptive and low-dependency solution for dynamic and complex scenarios. MATLAB simulations and experiments on the WPR1 platform demonstrate that, compared to traditional Jacobian matrix-based model-free methods, the proposed approach exhibits advantages in tracking accuracy, error convergence speed, and system stability. Full article
(This article belongs to the Special Issue Intelligent Control of Robotic System)
Show Figures

Figure 1

27 pages, 9914 KB  
Article
Design of Robust Adaptive Nonlinear Backstepping Controller Enhanced by Deep Deterministic Policy Gradient Algorithm for Efficient Power Converter Regulation
by Seyyed Morteza Ghamari, Asma Aziz and Mehrdad Ghahramani
Energies 2025, 18(18), 4941; https://doi.org/10.3390/en18184941 - 17 Sep 2025
Viewed by 670
Abstract
Power converters play an important role in incorporating renewable energy sources into power systems. Among different converter designs, Buck and Boost converters are popular, as they use fewer components and deliver cost savings and high efficiency. However, Boost converters are known as non–minimum [...] Read more.
Power converters play an important role in incorporating renewable energy sources into power systems. Among different converter designs, Buck and Boost converters are popular, as they use fewer components and deliver cost savings and high efficiency. However, Boost converters are known as non–minimum phase systems, imposing harder constraints for designing a robust converter. Developing an efficient controller for these topologies can be difficult since they exhibit nonlinearity and distortion in high frequency modes. The Lyapunov-based Adaptive Backstepping Control (ABSC) technology is used to regulate suitable outputs for these structures. This approach is an updated version of the technique that uses the stability Lyapunov function to produce increased stability and resistance to fluctuations in real-world circumstances. However, in real-time situations, disturbances with larger ranges such as supply voltage changes, parameter variations, and noise may have a negative impact on the operation of this strategy. To increase the controller’s flexibility under more difficult working settings, the most appropriate first gains must be established. To solve these concerns, the ABSC’s performance is optimized using the Reinforcement Learning (RL) adaptive technique. RL has several advantages, including lower susceptibility to error, more trustworthy findings obtained from data gathering from the environment, perfect model behavior within a certain context, and better frequency matching in real-time applications. Random exploration, on the other hand, can have disastrous effects and produce unexpected results in real-world situations. As a result, we choose the Deep Deterministic Policy Gradient (DDPG) approach, which uses a deterministic action function rather than a stochastic one. Its key advantages include effective handling of continuous action spaces, improved sample efficiency through off-policy learning, and faster convergence via its actor–critic architecture that balances value estimation and policy optimization. Furthermore, this technique uses the Grey Wolf Optimization (GWO) algorithm to improve the initial set of gains, resulting in more reliable outcomes and quicker dynamics. The GWO technique is notable for its disciplined and nature-inspired approach, which leads to faster decision-making and greater accuracy than other optimization methods. This method considers the system as a black box without its exact mathematical modeling, leading to lower complexity and computational burden. The effectiveness of this strategy is tested in both modeling and experimental scenarios utilizing the Hardware-In-Loop (HIL) framework, with considerable results and decreased error sensitivity. Full article
(This article belongs to the Special Issue Power Electronics for Smart Grids: Present and Future Perspectives II)
Show Figures

Figure 1

28 pages, 2891 KB  
Article
Integrated Operations Scheduling and Resource Allocation at Heavy Haul Railway Port Stations: A Collaborative Dual-Agent Actor–Critic Reinforcement Learning Framework
by Yidi Wu, Shiwei He, Zeyu Long and Haozhou Tang
Systems 2025, 13(9), 762; https://doi.org/10.3390/systems13090762 - 1 Sep 2025
Viewed by 899
Abstract
To enhance the overall operational efficiency of heavy haul railway port stations, which serve as critical hubs in rail–water intermodal transportation systems, this study develops a novel scheduling optimization method that integrates operation plans and resource allocation. By analyzing the operational processes of [...] Read more.
To enhance the overall operational efficiency of heavy haul railway port stations, which serve as critical hubs in rail–water intermodal transportation systems, this study develops a novel scheduling optimization method that integrates operation plans and resource allocation. By analyzing the operational processes of heavy haul trains and shunting operation modes within a hybrid unloading system, we establish an integrated scheduling optimization model. To solve the model efficiently, a dual-agent advantage actor–critic with Pareto reward shaping (DAA2C-PRS) algorithm framework is proposed, which captures the matching relationship between operations and resources through joint actions taken by the train agent and the shunting agent to depict the scheduling decision process. Convolutional neural networks (CNNs) are employed to extract features from a multi-channel matrix containing real-time scheduling data. Considering the objective function and resource allocation with capacity, we design knowledge-based composite dispatching rules. Regarding the communication among agents, a shared experience replay buffer and Pareto reward shaping mechanism are implemented to enhance the level of strategic collaboration and learning efficiency. Based on this algorithm framework, we conduct experimental verification at H port station, and the results demonstrate that the proposed algorithm exhibits a superior solution quality and convergence performance compared with other methods for all tested instances. Full article
(This article belongs to the Special Issue Scheduling and Optimization in Production and Transportation Systems)
Show Figures

Figure 1

24 pages, 3537 KB  
Article
Deep Reinforcement Learning Trajectory Tracking Control for a Six-Degree-of-Freedom Electro-Hydraulic Stewart Parallel Mechanism
by Yigang Kong, Yulong Wang, Yueran Wang, Shenghao Zhu, Ruikang Zhang and Liting Wang
Eng 2025, 6(9), 212; https://doi.org/10.3390/eng6090212 - 1 Sep 2025
Viewed by 952
Abstract
The strong coupling of the six-degree-of-freedom (6-DoF) electro-hydraulic Stewart parallel mechanism manifests as adjusting the elongation of one actuator potentially inducing motion in multiple degrees of freedom of the platform, i.e., a change in pose; this pose change leads to time-varying and unbalanced [...] Read more.
The strong coupling of the six-degree-of-freedom (6-DoF) electro-hydraulic Stewart parallel mechanism manifests as adjusting the elongation of one actuator potentially inducing motion in multiple degrees of freedom of the platform, i.e., a change in pose; this pose change leads to time-varying and unbalanced load forces (disturbance inputs) on the six hydraulic actuators; unbalanced load forces exacerbate the time-varying nature of the acceleration and velocity of the six hydraulic actuators, causing instantaneous changes in the pressure and flow rate of the electro-hydraulic system, thereby enhancing the pressure–flow nonlinearity of the hydraulic actuators. Considering the advantage of artificial intelligence in learning hidden patterns within complex environments (strong coupling and strong nonlinearity), this paper proposes a reinforcement learning motion control algorithm based on deep deterministic policy gradient (DDPG). Firstly, the static/dynamic coordinate system transformation matrix of the electro-hydraulic Stewart parallel mechanism is established, and the inverse kinematic model and inverse dynamic model are derived. Secondly, a DDPG algorithm framework incorporating an Actor–Critic network structure is constructed, designing the agent’s state observation space, action space, and a position-error-based reward function, while employing experience replay and target network mechanisms to optimize the training process. Finally, a simulation model is built on the MATLAB 2024b platform, applying variable-amplitude variable-frequency sinusoidal input signals to all 6 degrees of freedom for dynamic characteristic analysis and performance evaluation under the strong coupling and strong nonlinear operating conditions of the electro-hydraulic Stewart parallel mechanism; the DDPG agent dynamically adjusts the proportional, integral, and derivative gains of six PID controllers through interactive trial-and-error learning. Simulation results indicate that compared to the traditional PID control algorithm, the DDPG-PID control algorithm significantly improves the tracking accuracy of all six hydraulic cylinders, with the maximum position error reduced by over 40.00%, achieving high-precision tracking control of variable-amplitude variable-frequency trajectories in all 6 degrees of freedom for the electro-hydraulic Stewart parallel mechanism. Full article
Show Figures

Figure 1

Back to TopTop