Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (4,931)

Search Parameters:
Keywords = reward

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
22 pages, 3137 KB  
Article
Fault-Tolerant Attitude Control of Flexible Spacecraft via Reinforcement Learning
by Zhuoyue Peng and Qiang Shen
Aerospace 2026, 13(7), 571; https://doi.org/10.3390/aerospace13070571 (registering DOI) - 24 Jun 2026
Abstract
This paper proposes an integrated attitude control framework for flexible spacecraft subject to external disturbances, rigid–flexible dynamic coupling, and actuator faults. The control framework combines the Twin Delayed Deep Deterministic Policy Gradient (TD3) reinforcement learning algorithm with an adaptive fault-tolerant (AFT) compensator. First, [...] Read more.
This paper proposes an integrated attitude control framework for flexible spacecraft subject to external disturbances, rigid–flexible dynamic coupling, and actuator faults. The control framework combines the Twin Delayed Deep Deterministic Policy Gradient (TD3) reinforcement learning algorithm with an adaptive fault-tolerant (AFT) compensator. First, a rigid–flexible coupling dynamic model is formulated using Modified Rodrigues Parameters. Second, an observer-based TD3 attitude controller is designed, where a hierarchical reward function incorporating the observer-estimated flexible modal displacement η^ is constructed to train the agent for simultaneous attitude convergence and vibration suppression. Third, a composite fault-tolerant control structure is developed by integrating the trained TD3 policy with an adaptive sliding mode compensator that handles both partial loss-of-effectiveness faults and time-varying additive faults. The proposed framework is evaluated under a progressive five-scenario uncertainty evaluation framework encompassing measurement noise, parameter mismatch, external disturbances, and actuator faults. Simulation results demonstrate that (i) the η^-augmented reward enables substantial improvements in vibration suppression over the baseline reward, achieving a better balance between pointing accuracy and vibration attenuation; (ii) under the most demanding fault scenario, the AFT compensator proves essential for precise convergence, and the composite TD3+AFT architecture achieves the best overall performance among the four compared control schemes. Full article
Show Figures

Figure 1

36 pages, 11501 KB  
Article
A High- and Low-Level Decoupled Reinforcement Learning Method for Multi-UAV Cooperative Search
by Jianjie Qiu, Yichao Cai, Hao Li, Lei Ni, Kai Yuan and Siyuan Cui
Drones 2026, 10(7), 483; https://doi.org/10.3390/drones10070483 (registering DOI) - 24 Jun 2026
Abstract
Multi-UAV cooperative search with static unknown targets requires both efficient regional allocation and responsive local maneuvering. However, single-level learning methods often suffer from redundant coverage, unclear division of labor, and unstable training. This paper proposes a high- and low-level decoupled reinforcement learning method [...] Read more.
Multi-UAV cooperative search with static unknown targets requires both efficient regional allocation and responsive local maneuvering. However, single-level learning methods often suffer from redundant coverage, unclear division of labor, and unstable training. This paper proposes a high- and low-level decoupled reinforcement learning method for multi-UAV cooperative search. The high level periodically generates UAV-specific regional goals from visitation maps, target-existence belief maps, and UAV positions, while a spatial self-attention module enhances the representation of unvisited regions, high-belief target areas, and UAV distributions. The low level performs discrete steering actions based on local observations and high-level contexts, supported by a structured reward that encourages coverage, target discovery, goal-oriented progress, repeated-visit suppression, and boundary-safe motion. Simulation experiments are conducted in a two-dimensional grid environment with static targets and ideal sensing. Under this simplified simulation setting, the proposed method achieves higher training return and coverage rate than representative baseline algorithms while maintaining a high final target discovery rate and reaching the discovery threshold earlier. Ablation and visualization results further demonstrate the effectiveness and interpretability of the proposed hierarchical guidance mechanism within the considered simulation scenario. Full article
29 pages, 1685 KB  
Article
Robust Curriculum-Based SAC for End-to-End Motion Control of a 7-DOF Manipulator Under Sparse Rewards
by Yuhan Zhang and Jijun Gu
Electronics 2026, 15(13), 2784; https://doi.org/10.3390/electronics15132784 (registering DOI) - 24 Jun 2026
Abstract
End-to-end motion control of 7-degree-of-freedom (DOF) redundant manipulators under sparse reward signals presents a fundamental challenge in deep reinforcement learning (DRL) for robotics: the vast configuration space and absence of dense gradient information combine to produce severe cold-start failures and high cross-seed training [...] Read more.
End-to-end motion control of 7-degree-of-freedom (DOF) redundant manipulators under sparse reward signals presents a fundamental challenge in deep reinforcement learning (DRL) for robotics: the vast configuration space and absence of dense gradient information combine to produce severe cold-start failures and high cross-seed training variance. This paper proposes Curriculum-SAC-HER, a novel fusion framework integrating Soft Actor–Critic (SAC), Hindsight Experience Replay (HER), and a performance-driven three-stage Automatic Curriculum Learning (ACL) scheduler, designed to resolve the cold-start exploration bottleneck within a training budget of 300,000 environment interaction steps. The core methodology progressively expands the spatial target distribution across three stages of increasing difficulty, conditioning each stage transition on an 80% rolling success threshold to guarantee kinematic prior consolidation before advancing. A rigorous evaluation across 15 independent training runs (five seeds per group, all retained without filtering) demonstrates that the proposed framework achieves a final mean success rate of 84.8% (std: 11.0%), substantially surpassing the SAC + HER ablation (70.3%, Mann–Whitney U test, p = 0.028) and the DDPG baseline (22.3%, p = 0.008), while compressing cross-seed variance by 67% relative to the ablation. Zero-shot robustness evaluations under simulated domain perturbations further reveal that the learned policy maintains above 92% success across extreme friction variations and sustains 71.8% success under a 1.5× payload increase, demonstrating that the ACL module fosters generalized kinematic representations rather than over-fitting to specific contact mechanics. Full article
22 pages, 2177 KB  
Article
Research on Comprehensive Unit Price Estimation for Temporary Repair of Ship Equipment Based on the PPO Algorithm
by Zhiyin Wang and Li Xie
J. Mar. Sci. Eng. 2026, 14(13), 1164; https://doi.org/10.3390/jmse14131164 (registering DOI) - 24 Jun 2026
Abstract
After the completion of temporary repair of naval ship equipment, cost settlement has long relied on an ex post auditing model, which results in long cycles and a lack of immediate pricing references for the military. To address this issue, a comprehensive unit [...] Read more.
After the completion of temporary repair of naval ship equipment, cost settlement has long relied on an ex post auditing model, which results in long cycles and a lack of immediate pricing references for the military. To address this issue, a comprehensive unit price estimation method based on Proximal Policy Optimization (PPO) is proposed, which rapidly generates reasonable unit prices for each process after the repair is completed, thereby providing a quantitative benchmark for negotiation. The unit price estimation problem is formulated as a Markov decision process, and a multi-objective reward function combining range reward, compliance penalty, and final accuracy reward is designed. To alleviate the sparse reward problem, potential-based reward shaping using the Critic network is introduced, which decomposes the final accuracy signal into each pricing step. The clipping mechanism of PPO is adopted to limit the policy update amplitude, thereby improving training stability. Experimental results on 12,000 desensitized real repair records show that the proposed method achieves a mean absolute percentage error (MAPE) of 11.3%, a coefficient of determination (R2) of 0.913, and an abnormal estimation rate (AER) of 3.5%. Compared with standard PPO, the AER is reduced by 59%. The proposed method can sequentially output reasonable unit prices after repair completion, exploring a technical pathway for transforming temporary repair funding from ex post auditing to immediate verification. Full article
(This article belongs to the Special Issue Machine Learning Methodologies and Ocean Science, Second Edition)
Show Figures

Figure 1

13 pages, 269 KB  
Article
On the Role of Pure and Rational Religion in Adam Smith’s The Wealth of Nations
by Pilar Bravo de Lallana
Religions 2026, 17(7), 757; https://doi.org/10.3390/rel17070757 (registering DOI) - 24 Jun 2026
Abstract
Adam Smith’s system of natural liberty aimed at the happiness and virtue of humankind. Yet the Scottish philosopher also recognised that the system’s internal dynamics could render it unsustainable unless the state intervened to preserve sociability and justice through the education of both [...] Read more.
Adam Smith’s system of natural liberty aimed at the happiness and virtue of humankind. Yet the Scottish philosopher also recognised that the system’s internal dynamics could render it unsustainable unless the state intervened to preserve sociability and justice through the education of both the working and the middle and upper classes. This article argues that the educational programme envisaged in The Wealth of Nations entailed the triumph of a pure and rational religion, understood as the conviction that the Supreme Being valued and rewarded virtue alone, thereby reinforcing the sense of duty, together with an awareness of belonging to an impartially conceived, divinely ordered system, fostering humility. Full article
23 pages, 109510 KB  
Article
Efficiency-Aware Group Size Optimization for GRPO via Multi-Fidelity Bayesian Optimization
by Taehyeon Kim and Kyung-Taek Lee
AI 2026, 7(7), 234; https://doi.org/10.3390/ai7070234 (registering DOI) - 23 Jun 2026
Abstract
Group Relative Policy Optimization (GRPO) streamlines the alignment of Large Language Models (LLMs) and Vision–Language Models (VLMs) by eliminating the Critic model. However, its efficiency heavily depends on the group size, G. While a larger G improves reward estimation and stabilizes the [...] Read more.
Group Relative Policy Optimization (GRPO) streamlines the alignment of Large Language Models (LLMs) and Vision–Language Models (VLMs) by eliminating the Critic model. However, its efficiency heavily depends on the group size, G. While a larger G improves reward estimation and stabilizes the Advantage, Ai, it drastically increases VRAM usage and reduces throughput. Standard heuristics like a fixed G of 64 create significant bottlenecks in resource-constrained settings. This paper introduces an Efficiency-Aware optimization framework utilizing Multi-fidelity Bayesian Optimization and Hyperband (BOHB) to dynamically identify the optimal group size, G*. The method uses a multi-objective function that balances reward accuracy, Ai variance, and hardware utilization, applying z-score normalization. By employing Successive Halving to quickly evaluate candidates at low fidelity, the framework reduces search costs by up to 74% compared with random search. Tested across text-only LLMs (Qwen2.5-7B/1.5B) and multimodal VLMs (Qwen2.5-VL-3B), the framework demonstrates that the discovered G* saves up to 72.5% in VRAM compared with the baseline of 64, while maintaining reward accuracy within 5.8%. Sensitivity analyses on hyperparameters like λ, α, and β confirm the framework’s robustness. Rather than treating group size as a mere engineering heuristic, this study establishes a principled methodological advance by formalizing the trade-off between statistical estimation stability and hardware constraints into a unified optimization framework for resource-efficient RLHF. Full article
(This article belongs to the Section AI Systems: Theory and Applications)
23 pages, 586 KB  
Article
ESG Disclosure and Firm Value in Saudi Arabia: Evidence from Tadawul Listed Companies Using Dynamic GMM
by Fateh Belouadah, Hassan Ali Alqahtani, Howaida Mohamed Fadol Mohamed, Shadia Daoud Gamer, Nacera Taher Benchohra Belghaouti and Zaki Ahmad
Sustainability 2026, 18(13), 6403; https://doi.org/10.3390/su18136403 (registering DOI) - 23 Jun 2026
Abstract
This study examines the impact of ESG disclosure, leverage, and profitability on firm value, measured by Tobin’s Q, among 67 non-financial Tadawul-listed companies in Saudi Arabia over the period 2015–2024. ESG disclosure is captured through a manual content-analysis index that scores the proportion [...] Read more.
This study examines the impact of ESG disclosure, leverage, and profitability on firm value, measured by Tobin’s Q, among 67 non-financial Tadawul-listed companies in Saudi Arabia over the period 2015–2024. ESG disclosure is captured through a manual content-analysis index that scores the proportion of expected environmental, social, and governance items reported by each firm. The study further investigates whether board independence moderates these relationships while controlling for liquidity, firm size, current ratio, capital expenditure, and board size. Methodologically, the study employs the two-step system generalized method of moments (system GMM) estimator, which addresses dynamic persistence, endogeneity, and unobserved heterogeneity. The findings reveal that ESG disclosure has a positive and significant effect on firm value, indicating that the Saudi market increasingly rewards firms that provide broader sustainability-related information. Profitability also exerts a positive influence on Tobin’s Q, while leverage has a negative and significant effect, suggesting that higher debt weakens market valuation. Among the moderating effects, board independence significantly reduces the negative impact of leverage on firm value, although it does not significantly strengthen the positive ESG disclosure–firm value relationship. The results also show that liquidity, firm size, capital expenditure, and board size positively influence firm value. The study’s novelty lies in being the first, to our knowledge, to integrate ESG disclosure, financial structure, profitability, and board independence within a single dynamic firm-value framework over a decade-long panel that brackets the Saudi Exchange’s 2021 ESG disclosure guideline. In doing so, it advances emerging-market ESG research by showing that, under Saudi Arabia’s largely voluntary disclosure regime and concentrated-ownership structure, board independence operates primarily as a risk-monitoring mechanism rather than as an amplifier of disclosure value. The findings imply that regulators should strengthen and progressively mandate ESG reporting frameworks, that investors should treat ESG transparency as value-relevant information, and that firms should view ESG transparency and prudent governance as strategic tools for enhancing market value in line with Vision 2030. Full article
(This article belongs to the Section Sustainable Management)
27 pages, 1655 KB  
Article
Multi-Model Ensemble Evaluation of Student Design Projects in Higher Education: A Comparative Analysis of AI and Human Expert Grading
by Filip Cvitić, Tajana Koren Ivančević and Nikolina Stanić Loknar
Technologies 2026, 14(7), 382; https://doi.org/10.3390/technologies14070382 (registering DOI) - 23 Jun 2026
Abstract
This study investigates the potential, limitations, and pedagogical implications of applying a parallel multi-model AI evaluation workflow, using ChatGPT, DeepSeek, and Uizard, to assess student design projects in higher education. Because design assessment involves both formal criteria and subjective creative interpretation, the study [...] Read more.
This study investigates the potential, limitations, and pedagogical implications of applying a parallel multi-model AI evaluation workflow, using ChatGPT, DeepSeek, and Uizard, to assess student design projects in higher education. Because design assessment involves both formal criteria and subjective creative interpretation, the study first established a human expert baseline based on three independent university professors. The human inter-rater reliability was low to moderate, with a mean pairwise Spearman’s ρ of 0.36 and Cronbach’s α of 0.60 for packaging design, and ρ of 0.43 and α of 0.69 for web design. This finding is central to the study, as it shows that the human benchmark in creative design assessment is itself variable and interpretive. Against this baseline, AI–human alignment remained limited and task-dependent. For packaging design, the AI ensemble showed only a weak positive association with the human expert baseline (Spearman’s ρ = 0.30, p = 0.031), which should be interpreted cautiously given the Bonferroni-adjusted significance threshold used in the study. For web design, no significant AI–human association was observed. Qualitative analysis of AI-generated rationales identified recurring limitations, including hallucination, aesthetic shield effects, and missed context, where visually polished work was rewarded despite deeper conceptual or structural weaknesses. The findings suggest that current AI systems can provide useful formative feedback on visible formal features, but they are not reliable as autonomous grading tools for complex creative work. AI-assisted assessment is therefore best understood as a supervised formative support mechanism, while final evaluation should remain grounded in human pedagogical judgment. Full article
Show Figures

Figure 1

27 pages, 932 KB  
Article
Beyond the Carrot and the Stick: Communication, Autonomy, and Volunteer Motivation in Nonprofit Organizations
by Iulia-Georgiana Hermeneanu, Dana Adriana Lupsa-Tătaru and Ioana-Simona Ivasciuc
Adm. Sci. 2026, 16(7), 301; https://doi.org/10.3390/admsci16070301 (registering DOI) - 23 Jun 2026
Abstract
Conventional approaches to motivating individuals within firms emphasize external incentives, sometimes referred to as the “carrot and stick” paradigm. However, such elements are often absent in volunteer environments, where incentive is derived from psychological and relational influences. In the work context, volunteers are [...] Read more.
Conventional approaches to motivating individuals within firms emphasize external incentives, sometimes referred to as the “carrot and stick” paradigm. However, such elements are often absent in volunteer environments, where incentive is derived from psychological and relational influences. In the work context, volunteers are an exceptional case as they lack traditional extrinsic incentives, making them suitable for researching motivation outside this paradigm. This study, based on Self-Determination Theory, explores the impact of communication methods on motivation, satisfaction, and retention of volunteers. The study employs a qualitative design to analyze data from 91 volunteers and 6 coordinators in nonprofit organizations, using content analysis conducted with ATLAS.ti version 26. The findings demonstrate that communication functions as a crucial motivator by promoting autonomy, competence, and relatedness. Volunteers are intrinsically driven by their engagement, the opportunity to make a significant contribution, and experiential learning. Conversely, coordinators influence these experiences by providing feedback, advice, and chances for engagement. The findings indicate a struggle between autonomy and control, illustrating variations in motivation within organizational contexts. The study contributes to existing knowledge by demonstrating that communication serves as a primary motivator and engagement catalyst in the absence of external rewards. This holds significant ramifications for nonprofit administration and motivational philosophy. Full article
Show Figures

Figure 1

20 pages, 888 KB  
Article
Preserved Aesthetic Judgements in Parkinson’s Disease: A Case–Control Study Suggests Limited Need for Content Adaptation for Receptive Arts Engagement
by Blanca T. M. Spee, Domicele Jonauskaite, Bastiaan R. Bloem, Emmy van den Berg, Nina Verhoeven, Dagne Bagdonaviciute, Nicolien Dam, Julia S. Crone, Jorik Nonnekes, David Steyrl and Matthew Pelowski
J. Clin. Med. 2026, 15(13), 4865; https://doi.org/10.3390/jcm15134865 (registering DOI) - 23 Jun 2026
Abstract
Background/Objectives: Parkinson’s disease (PD) is increasingly recognized as a multisystem disorder affecting perceptual, emotional, and reward-related processes. While arts-based interventions in PD have primarily focused on active creative arts engagement, it remains unclear whether receptive arts engagement with visual art—how artworks are perceived [...] Read more.
Background/Objectives: Parkinson’s disease (PD) is increasingly recognized as a multisystem disorder affecting perceptual, emotional, and reward-related processes. While arts-based interventions in PD have primarily focused on active creative arts engagement, it remains unclear whether receptive arts engagement with visual art—how artworks are perceived and evaluated—is altered. Our objective is to determine whether aesthetic evaluation of visual artworks differs in individuals with PD compared to age-matched healthy controls. We further examine whether emotional interpretation, color-emotion associations, and experiential responses to art viewing are altered. Methods: In a cross-sectional case–control study, individuals with PD (n = 87) and age-matched healthy controls (n = 49) completed two online assessments. Participants evaluated 36 artworks from the Vienna Art Picture System in terms of liking, beauty, and subjective art attributes. Objective image-derived features were computed for each artwork. Interpretable machine learning models were used to test whether evaluation patterns predicted diagnostic group and to identify determinants of aesthetic judgments. Participants further completed a color-emotion association task using ambiguous expressive portraits and reported perceived changes in cognitive, emotional, motivational, and physical states following art viewing. Results: Aesthetic evaluation patterns did not support reliable classification of PD status, indicating no systematic group differences in liking, beauty, or attribute-based judgments between PD and controls. Instead, aesthetic judgments were robustly predicted by individual differences and objective artwork properties, including art-historical style, symmetry, complexity, and color-related features, whereas diagnostic group, gender, and age did not contribute to predictions. Emotional interpretation and color-emotion associations were largely comparable between groups, with a single specific deviation in color-emotion mapping. Positive emotions were less frequently associated with pink in people with PD. Self-reported experiential responses to art viewing did not differ significantly between groups. Conclusions: Aesthetic evaluation of visual artworks appears largely preserved in people with PD. These findings suggest that, in digital viewing contexts, substantial adaptation of visual content to make it accessible for people with PD may not be necessary, although subtle perceptual and emotional differences may still be relevant. Efforts may instead be better directed toward addressing practical barriers to visual art engagement. Full article
(This article belongs to the Special Issue Parkinson's Disease: Recent Advances in Diagnosis and Treatment)
Show Figures

Figure 1

24 pages, 747 KB  
Article
Cluster-Based Q-Learning Relational Game (C-QLRG): A Practical Relaxation for Asymmetric Online Social Networks
by Duc Nghia Vu and Janos Demetrovics
AI 2026, 7(6), 231; https://doi.org/10.3390/ai7060231 (registering DOI) - 22 Jun 2026
Abstract
The Q-Learning Relational Game (QLRG) framework provides a theoretically rigorous method for identifying minimal winning coalitions in online social networks (OSNs) under the restrictive assumption of global agent symmetry or uniform matroid structure. Real-world OSNs, however, exhibit significant asymmetry. This paper introduces the [...] Read more.
The Q-Learning Relational Game (QLRG) framework provides a theoretically rigorous method for identifying minimal winning coalitions in online social networks (OSNs) under the restrictive assumption of global agent symmetry or uniform matroid structure. Real-world OSNs, however, exhibit significant asymmetry. This paper introduces the Cluster-Based Q-Learning Relational Game (C-QLRG), a practical extension that relaxes the global symmetry requirement by leveraging community structure. We partition the agent set into communities with bounded internal variation and represent the state solely by community membership counts of the seed set. Because the closure operator already captures all eventual influence spread, the problem reduces to a sequential seed selection task where the agent decides, at each step, from which community to add the next seed. We prove that the optimal Q-function of a suitably regularized reach-efficiency objective is Lipschitz continuous and derive a performance bound for the learned policy. The full algorithm is presented, and its complexity is analyzed. Empirical evaluations on a synthetic asymmetric network and Zachary’s Karate Club demonstrate that C-QLRG is highly sensitive to reward parameters, where default settings lead to premature stopping, but parameter tuning combined with a corrected minimality verification recovers high-efficiency coalitions by removing non-contributing agents. With tuned parameters, C-QLRG produces a near-winning coalition of size 11 and 99% reach on the synthetic network, surpassing the greedy baseline’s efficiency (size 12) despite a one-node coverage gap, while identifying the optimal winning coalition of size 1 on the Karate Club dataset, matching all baselines. The framework thus offers a principled trade-off between model fidelity and scalability, with the reward design choice being critical for practical deployment. Full article
Show Figures

Figure 1

23 pages, 1183 KB  
Article
Modeling AI-Assisted Plagiarism in Academic Social Environments Using Qualitative Plausibility Assessment Supports of the Simulation by Large Language Models
by Ihsan Ibrahim, Anak Agung Putri Ratna, Prima Dewi Purnamasari and Naoki Fukuta
Systems 2026, 14(6), 721; https://doi.org/10.3390/systems14060721 (registering DOI) - 22 Jun 2026
Abstract
This study investigates how AI-assisted plagiarism changes dishonest academic behavior in a socially interactive learning environment under different educational conditions. To this end, this study develops a scenario-based simulation to examine how AI-assisted plagiarism influences dishonest academic behavior in socially interactive learning environments. [...] Read more.
This study investigates how AI-assisted plagiarism changes dishonest academic behavior in a socially interactive learning environment under different educational conditions. To this end, this study develops a scenario-based simulation to examine how AI-assisted plagiarism influences dishonest academic behavior in socially interactive learning environments. The model represents students as autonomous agents embedded in local peer networks who adapt their weekly behavior under academic pressure, institutional intervention, and available cheating options. Two behavioral scenarios are considered: a conventional plagiarism environment, in which agents choose between honest submission and direct copying, and an AI-augmented environment, in which AI-assisted plagiarism is introduced as an additional dishonest strategy. Intervention is modeled through environmental and institutional conditions, specifically detection probability and sanction severity, rather than through direct internal reward manipulation. Q-learning is used as a simplified adaptive mechanism for repeated agent choice. Experimental results show that the possibility of producing and assessing a simulation to see the availability of AI-assisted plagiarism substantially changes the behavioral composition of misconduct by increasing total dishonest behavior and shifting a large share of it toward the AI-assisted category. In the simulation, active intervention reduces dishonest behavior overall but does not eliminate AI-assisted plagiarism as the dominant dishonest strategy in the AI-augmented environment. These observations in the simulation suggest that academic misconduct in the AI era should be understood not only as a problem of deterrence but also as a problem of behavioral adaptation under changing technological and institutional conditions. To support the realism assessment of the simulation design, the study also conducts a structured qualitative plausibility review using multiple large language models under a shared prompt. Across these reviews, the model is judged to be acceptable as a first-stage stylized baseline, while important limitations are identified in agent heterogeneity, social influence depth, and the use of Q-learning as a simplified adaptive heuristic to reproduce the behaviors of actors in there. Full article
Show Figures

Figure 1

37 pages, 1597 KB  
Article
Topology-Aware Graph Reinforcement Learning for Voltage-Reactive Power Control in Grid-Connected Microgrids
by Yunfei Zhang, Kefan Bao, Gaige Liang, Wennan Zhuang, Longlong Qiang, Difei Tang, Xiangyu Lu and Mingxiao Zhang
Electricity 2026, 7(2), 60; https://doi.org/10.3390/electricity7020060 (registering DOI) - 22 Jun 2026
Abstract
As the global energy transition accelerates, distribution systems are integrating increasing shares of inverter-interfaced renewables, making reliable voltage support a key operational requirement. In grid-connected microgrids, especially weak radial feeders in rural and remote areas, voltage-reactive power (Volt/Var) control must coordinate multiple inverters [...] Read more.
As the global energy transition accelerates, distribution systems are integrating increasing shares of inverter-interfaced renewables, making reliable voltage support a key operational requirement. In grid-connected microgrids, especially weak radial feeders in rural and remote areas, voltage-reactive power (Volt/Var) control must coordinate multiple inverters under uncertainty from photovoltaic (PV) intermittency, load volatility, and point-of-common-coupling (PCC) disturbances. Existing droop, model-based optimization, and non-graph reinforcement learning (RL) approaches often rely on fixed rules or do not explicitly exploit electrical topology, which limits adaptive coordination. To address this gap, we propose a topology-aware graph reinforcement learning framework for voltage-reactive power control in grid-connected microgrids under uncertainty. The method encodes node states with a graph convolutional network (GCN) and learns coordinated PV/storage reactive-power actions via proximal policy optimization (PPO) with a multi-objective reward balancing voltage quality, control effort, and action smoothness. In a controlled comparison against a multilayer perceptron (MLP)-PPO baseline with identical action space, reward, and PPO objective, our method reduces voltage violation rate (VVR) from 0.0316 ± 0.0086 to 0.0048 ± 0.0019. Additional validation on a modified IEEE 33-bus feeder further reduces VVR from 0.00726 for MLP-PPO and 0.02999 for Droop control to 0.00095, supporting the effectiveness of topology-aware state representation on a larger radial benchmark feeder. Full article
Show Figures

Figure 1

23 pages, 7704 KB  
Article
Risk-Sensitive Distributional Proximal Policy Optimization for Safe Highway Lane-Change Decision-Making
by Qing Ye, Rongliang Zhou, Jiakun Huang, Yaxuan Liu and Xiaolin Song
Appl. Sci. 2026, 16(12), 6271; https://doi.org/10.3390/app16126271 (registering DOI) - 22 Jun 2026
Abstract
Decision-making is a critical module for intelligent vehicles to achieve safe and efficient autonomous driving. However, most existing reinforcement learning-based decision-making methods optimize policies by maximizing the expected return, which may inadequately account for low-probability but high-cost safety risks in complex traffic interactions. [...] Read more.
Decision-making is a critical module for intelligent vehicles to achieve safe and efficient autonomous driving. However, most existing reinforcement learning-based decision-making methods optimize policies by maximizing the expected return, which may inadequately account for low-probability but high-cost safety risks in complex traffic interactions. To address this issue, this paper proposes a Risk-Sensitive Distributional Proximal Policy Optimization (PPO) method, termed Risk-Sensitive Distributional Proximal Policy Optimization (RSDPPO), for highway lane-changing decision-making. Within the PPO framework, a distributional state-value function is introduced to model the return distribution under the current policy, and a Wang distortion-based risk measure is further incorporated to construct a risk-sensitive advantage function. In this way, risk information contained in the return distribution can be propagated into the policy gradient update, guiding the learned policy to avoid high-risk driving behaviors while maintaining training stability. Simulation experiments are conducted in a highway lane-changing scenario with heterogeneous surrounding vehicles. The results show that, under medium-density traffic, the proposed method outperforms representative baseline algorithms in cumulative reward, success rate, and safety reward. Further evaluation under higher-density traffic demonstrates that RSDPPO maintains better overall performance, indicating stronger adaptability to denser traffic conditions. Ablation studies further show that risk-averse distortion improves the balance between safety and efficiency by increasing safety margins during car-following and lane-changing maneuvers. These results indicate that RSDPPO provides an effective risk-sensitive policy optimization framework for safety-oriented highway lane-changing decision-making. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

17 pages, 13011 KB  
Article
An Anti-Swept-Frequency-Jamming Communication Method Based on Proximal Policy Optimization for Nonlinear Scenarios
by Xinrui Xu, Ke Yin, Yingtao Niu and Huacheng Zhu
Electronics 2026, 15(12), 2737; https://doi.org/10.3390/electronics15122737 (registering DOI) - 22 Jun 2026
Abstract
With the advancement in electronic attack technologies, intelligent jamming poses a significant challenge to the reliable transmission of wireless communications. Traditional anti-jamming methods often fail to adapt to dynamic nonlinear jamming environments. This paper addresses nonlinear swept-frequency jamming by modeling anti-jamming communication as [...] Read more.
With the advancement in electronic attack technologies, intelligent jamming poses a significant challenge to the reliable transmission of wireless communications. Traditional anti-jamming methods often fail to adapt to dynamic nonlinear jamming environments. This paper addresses nonlinear swept-frequency jamming by modeling anti-jamming communication as a sequential decision-making problem and proposes an intelligent anti-jamming method based on proximal policy optimization (PPO) to optimize dynamic channel selection. Firstly, the channel selection problem is formalized as a Markov decision process (MDP), where a state space integrating jamming patterns and communication status is designed, the channel set is defined as the action space, and a multi-objective reward function trades off jamming avoidance against switching overhead. A dual-network architecture comprising a policy network and a value network is constructed, and the PPO algorithm is employed for policy updates, where a clipping mechanism is used to enhance training stability. The system optimizes the anti-jamming strategy online through a closed-loop process of “sensing–decision–learning–communication”. Simulation results demonstrate that compared to conventional methods, the proposed method significantly improves key performance indicators such as packet success rate and throughput. It can rapidly track changes in jamming, exhibiting excellent real-time performance and environmental robustness, and thus provides an effective solution for reliable communication in dynamic jamming environments. Full article
(This article belongs to the Section Microwave and Wireless Communications)
Show Figures

Figure 1

Back to TopTop