Review Reports
- Bo Hou1,2,
- Guangyu Pan1 and
- Yao Chen3,*
Reviewer 1: Anonymous Reviewer 2: Mengjie Zhou
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe manuscript presents Value-Based Reward Shaping (VBRS) for adversarial reinforcement learning, combining a handcrafted dense reward with a critic-based value signal to improve reward alignment and win rate in a 2D combat environment.
- The discussion of “alignment” would benefit from a clearer theoretical grounding: Section 3.2 implicitly assumes (V(s;\omega)\approx V^{\pi}(s)) without quantifying the approximation error, and the resulting shaping effectively reweights long-horizon contributions in a way that could change the optimal policy.
- There is an inconsistency between the claim of using PPO (Table 1) and the presentation of Algorithm 1 as generic gradient ascent; the authors should explicitly state the PPO objective, clarify where the auxiliary critic is used (e.g., GAE, clipping, entropy terms), and explain how value normalization or detaching is handled.
- The adaptive ratio rule (Eq. 19) relies on a noisy estimate (\hat{J}); this quantity should be precisely defined (how it is computed over batches, whether it is smoothed, and how its sign is treated), and the method would be strengthened by stability safeguards and an ablation comparing fixed (\beta) to the learned schedule.
- The experimental section would benefit from reporting confidence intervals and statistical tests on final win rates; while 10 seeds are acceptable, clearer inferential reporting and evaluation details (episodes per evaluation, seed control, opponent stochasticity) are needed.
- Evaluating only against PPO without VBRS makes it difficult to isolate the contribution of the proposed shaping; stronger baselines are recommended, including potential-based reward shaping (handcrafted and learned) and at least one intrinsic-shaping or adversarial-RL comparator.
- The references are generally current, but the positioning of VBRS could be improved by citing a small set of canonical works or surveys on potential-based reward shaping and adversarial or robust self-play reinforcement learning.
Author Response
Thanks for your valuable comments. We have revised the manuscript accordingly, and the detailed responses are provided in the attached PDF.
Author Response File:
Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for Authors1. I have concerns about adding V(st; ω) directly to the reward. The value function V(s) represents expected cumulative future reward, which already implicitly includes future rewards. Adding this to the immediate reward creates a form of double-counting that isn't clearly justified. The normalization mentioned in the text (lines 192-195) is vague - what normalization scheme exactly?
2. The mechanism for updating β based on improvement ratios seems fragile. What happens with noisy objectives? Small denominator values could cause instability despite the δ term. Additionally, starting β at 0 and only increasing it when training "stabilizes" seems to defeat the purpose - you might want the value guidance most during early exploration when the handcrafted reward is most misleading.
3. Figure 3 shows learning curves with shaded regions from 10 seeds, but what do the shaded regions represent (std dev, std error, min-max)? Besides, there is no statistical tests compare final performance.
4. Some more recent work on learned reward shaping and reward learning could be included.
Author Response
Thanks for your valuable comments. We have revised the manuscript accordingly, and the detailed responses are provided in the attached PDF.
Author Response File:
Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe revised manuscript examines Value-Based Reward Shaping (VBRS) for adversarial reinforcement learning, where a learned state-value signal is incorporated into a dense reward, and evaluates PPO with VBRS in a 2D dogfight-style environment against fixed opponents. In this second round, the presentation is noticeably clearer—particularly regarding the PPO formulation, the shaping schedule, and the statistical reporting—and the empirical evidence for improved win rates is stronger.
- The authors correctly clarify that VBRS does not fall under policy-invariant shaping and instead alters state-occupancy weighting. That said, the discussion of “alignment” still feels more heuristic than formal; it would be preferable to phrase this as an empirical reduction of misalignment, without suggesting theoretical guarantees.
- The PPO implementation is now well explained (including the ratio, clipping, entropy term, and use of shaped advantages). For full reproducibility, Table 1 and Algorithm 1 should also specify whether the policy log-standard deviation is learned, fixed, or bounded, and clearly describe how action squashing or clipping is applied.
Author Response
Thank you for insightful comments and suggestions. We have revised the manuscript accordingly and addressed all points in detail below.
Author Response File:
Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for Authorsaccept
Author Response
Thank you for the careful assessment and helpful comments on our submission, as well as the time and effort devoted to the review. Your valuable insights have greatly helped us improve the quality of the paper.
Best wishes for you!