1. Introduction
Reinforcement learning (RL) is a branch of machine learning where an agent learns optimal behavior or policy through trial-and-error interactions with outer environments [
1]. The agent receives scalar rewards as feedback and aims to maximize the cumulative reward over time. RL is particularly powerful for sequential decision-making tasks, such as robotic control [
2], game playing [
3], robotic manipulation [
4], and humanoid robots [
5].
Deep Reinforcement Learning (DRL) integrates deep neural networks with RL, enabling agents to handle high-dimensional sensory inputs (e.g., images, sensor data) and continuous action spaces [
6]. DRL algorithms, such as Deep Deterministic Policy Gradient (DDPG) [
7] and Proximal Policy Optimization (PPO) [
8], have achieved remarkable success in various domains. They allow systems to adapt dynamically to changing conditions, optimize long-term performance, and learn end-to-end control policies without precise mathematical models.
Despite their power, DRL methods still face challenges in sample efficiency and training stability, particularly when rewards are delayed or sparse [
9]. In reinforcement learning, the reward function is a critical component from environments that defines the agent’s objective by providing scalar feedback signals in response to its actions. It essentially tells the agent what goals to achieve by assigning positive or negative rewards for specific states or behaviors during interactions. However, a poorly designed or inherently sparse reward function can lead to significant learning difficulties.
When rewards are only provided upon task completion, the agent receives little-to-no guidance during the majority of its interactions, making exploration inefficient and the learning procedure extremely slow. In robotics such as humanoid robots, reinforcement learning can train controllers or agents to find optimal solutions for complex tasks by enabling the robot to interact repeatedly with the environment [
10]. The reward function is an important aspect that guides humanoid robots to find the desired solution successfully. However, a sparse reward in tasks like motion tracking makes the training procedure inefficient.
Reinforcement learning always faces the trade-off between the ease of designing reward functions and the ease of learning from rewards, and reward shaping provides a solution [
11]. Reward shaping entails modifying the reward function of a problem to expedite the training speed and efficiency of reinforcement learning algorithms [
12].
The primary goal of reward shaping is to modify the agent’s learning environment by introducing supplementary reward signals [
13]. This allows the agent to receive more frequent and informative feedback during its interactions, facilitating faster and more effective strategy adjustments to achieve the intended learning objectives.
Reward shaping is broadly categorized into two main types: curiosity-driven exploration, which motivates agents to explore novel states by providing intrinsic rewards for encountering unfamiliar or unpredictable situations [
14]; and potential-based reward shaping (PBRS), which guides the agent by adding a reward signal derived from the difference in a potential function between consecutive states, thereby shaping the learning process without altering the optimal policy [
15].
Reward shaping in reinforcement learning is a widely studied and active research area. To provide denser feedback, Ref. [
16] introduces a shaped reward function based on potential functions to address the sparse reward problem in DQN for UAV emergency communication. By incorporating a shaping reward derived from the difference in potential between consecutive states, the method provides dense intermediate feedback to guide the agent’s exploration. This approach accelerates training convergence and improves overall performance while theoretically preserving the optimal policy of the original MDP. In [
17], authors propose Rank2Reward, a novel reward shaping method that learns well-shaped reward functions directly from passive video demonstrations. The key innovation is to leverage the logit output of a video-discriminator as a ranking-based potential function. By incorporating these, the method provides dense, directional feedback that guides policy exploration towards more expert-like states, enabling effective reinforcement learning without requiring expensive state-action annotations. For robotic manipulation, Ref. [
18] presents a framework for robotic hand manipulation that combines reward shaping with domain randomization to achieve efficient sim-to-real transfer. The key contribution lies in designing a dense, multi-component shaped reward function to accelerate RL training in simulation. Meanwhile, Ref. [
19] proposes a novel safety-oriented reward shaping framework inspired by barrier functions (BFs). The key contribution is introducing a shaping reward term derived from the barrier function condition, which provides dense feedback to guide the agent towards safe states without requiring value function estimation. This method enhances training efficiency and reduces actuation effort while inherently promoting safety, as demonstrated in both simulation and real-world deployment on a quadruped robot.
While reward shaping (RS) is a powerful technique for guiding agent learning by providing additional rewards, designing such effective shaping functions remains a significant challenge. Constructing appropriate shaping rewards that reliably accelerate learning without altering the optimal policy requires deep domain knowledge and careful tuning—a challenge analogous to the difficulty of designing control Lyapunov functions (CLFs) in control theory [
20], where both aim to ensure system convergence through carefully crafted functions. In [
21], the authors propose a Lyapunov-based reward shaping strategy that incorporates physics-informed stability guarantees into reinforcement learning by using the negative change in a Lyapunov function as a shaping reward, ensuring stable and optimal control of robot manipulators. However, the construction of such a Lyapunov function lacks a systematic approach. Ref. [
22] proposes a state-space segmentation-based potential function for potential-based reward shaping, where constant potential values are assigned to segmented regions of the state space, enabling systematic and efficient guidance for reinforcement learning agents. However, this potential function is designed offline and relies on predefined state-space segmentation, making it unable to adapt promptly to dynamic or changing environments. As a result, the agent may struggle to generalize across varying conditions or respond effectively to unforeseen disturbances.
In [
23], the authors introduce adaptiveness into shaping rewards. They propose the EGCARL method, featuring a Dynamic Context-Aware Reward Weighting (DCARW) mechanism that dynamically adjusts reward weights based on real-time traffic conditions, combined with a Generative Adversarial Imitation Learning (GAIL) mechanism for expert-guided reward generation, enabling adaptive and efficient autonomous driving.
Therefore, there is a need for an online reward shaping method that can dynamically adjust the shaping signal in real time, enabling the agent to interact more effectively and explore more efficiently in diverse and evolving environments. In [
23], the authors introduce adaptiveness into reward shaping by proposing the EGCARL method, which features a Dynamic Context-Aware Reward Weighting (DCARW) mechanism that adjusts reward weights in real time according to traffic conditions, enabling adaptive and efficient autonomous driving. However, this approach still relies on expert experience for effective reward generation.
In summary, there is a clear need for a reward shaping framework that is both adaptive and independent of expert supervision. Ideally, such a reward function should be dynamic rather than static, capable of adjusting itself in real time according to the environment, thereby providing timely and relevant guidance for efficient exploration and policy learning. Moreover, it should not rely on expert demonstrations; instead, the adaptiveness should emerge from an online optimization process that autonomously refines the shaping signal during learning. This would enable robust and generalizable learning across diverse and changing environments, paving the way for more autonomous and scalable reinforcement learning in real-world applications.
Thus, we introduce a novel reward framework, named two-layered reward, as illustrated in 
Figure 1. The overall reward function consists of two components: (1) the goal reward, which reflects the primary task objective, such as reproducing a desired motion trajectory in a humanoid robot; and (2) the optimizing reward, which incorporates secondary criteria like stability (e.g., preventing falls), smoothness, energy efficiency, and safety considerations. The goal reward forms the first layer of the framework and is used to evaluate task-level performance. The outcome of this evaluation is then fed into the second layer, where the optimizing reward is adjusted online through a meta-heuristic optimization process. This adaptation is performed without relying on expert demonstrations, enabling fully autonomous tuning of the shaping signal during learning. We implement this online optimization using meta-heuristic algorithms such as the Grey Wolf Optimizer (GWO) [
24] and Optimal Stochastic Process Optimizer (OSPO) [
25], allowing the agent to dynamically balance primary objectives with auxiliary constraints in response to environmental changes. This approach not only enhances learning efficiency and policy performance but also ensures robustness and adaptability in complex, dynamic tasks, addressing key limitations of existing reward shaping methods.
The proposed two-layered reward framework differs from existing approaches, such as hierarchical curricula [
26], potential-based reward shaping (PBRS) [
27], and meta-gradient reward learning [
28], in both mechanism and objective. Hierarchical curricula rely on manually designed training stages with predefined difficulty progression. While effective, they require expert knowledge to specify transition conditions and lack adaptability to varying learning dynamics. In contrast, our method automatically adjusts the influence of auxiliary objectives based on real-time performance, eliminating the need for hand-crafted schedules. PBRS modifies rewards using a potential function to preserve the optimal policy, but this imposes strict mathematical constraints that limit the types of auxiliary objectives that can be incorporated. Our approach treats the lower-layer reward as a general optimization target, enabling flexible integration of domain-specific criteria without theoretical restrictions. Meta-gradient methods learn reward weights via gradient estimation, yet they suffer from high variance and instability, especially when trained on noisy or suboptimal trajectories. In contrast, our method employs a population-based meta-heuristic optimizer that updates reward weights using only high-performing rollouts. In summary, while prior methods require manual scheduling, impose theoretical constraints, or depend on unstable gradient estimates, our framework enables online, stable, and goal-conditioned reward shaping, autonomously balancing task completion with behavioral quality through performance-constrained adaptation.
  3. Two-Layered Reward Reinforcement Learning
  3.1. Motivations for Two-Layered Reward
In reinforcement learning (RL), the reward function design plays an important role in shaping the learning efficiency and final performance of the agent. Traditional reward shaping methods typically employ a static, fixed-weight linear combination of multiple sub-rewards, such as Equation (5). However, such an approach has inherent limitations: the weights remain constant throughout the training procedure, and it cannot adapt to the evolving needs across different learning stages or respond to dynamically changing environmental conditions.
For instance, in the humanoid robot motion tracking tasks considered in this paper, the priorities of learning objectives gradually shift during training. Specifically, in the early stages, the primary goal is to maintain balance and avoid falling, making energy efficiency or precise trajectory tracking secondary. As training progresses and basic stability is achieved, the focus shifts toward accurately following the reference motion trajectory. Only in the later stages, once stable tracking has been established, should optimization emphasize secondary objectives such as energy conservation, motion smoothness, and joint effort minimization. A static reward function with fixed weights cannot effectively capture these evolving priorities, often leading to suboptimal learning dynamics and slowing down the initial training phase, as the critical but challenging objective of simply avoiding falls may be overshadowed by other goals such as precise trajectory tracking.
Moreover, real-world environments are inherently dynamic and uncertain. If the reward function can adapt online based on the current state and task progress, it can guide the agent toward more meaningful exploration, thereby accelerating convergence and improving policy robustness.
To address these challenges, we propose a two-layered reward structure that decomposes the overall reward into two distinct components:
- (1)
- Upper-layer reward (goal reward ): This represents the core task objective and is static and fixed, such as the tracking error between the robot’s body pose and the reference motion. It directly reflects task completion and serves as the fundamental driving force for policy learning. 
- (2)
- Lower-layer reward (optimizing reward  and ): This consists of auxiliary, manually designed sub-rewards, such as penalties on action magnitude, joint velocity, or orientation deviation, which encourage desirable behaviors beyond basic task execution. These sub-rewards are not essential for task success but act as “polishing” terms that refine the policy quality. 
Crucially, the lower-layer reward is not merely a matter of weight adjustment; rather, it constitutes a “priority-driven optimization process.” The upper-layer reward establishes a performance threshold, and the optimization of the lower-layer reward is activated only when this primary goal is sufficiently achieved.
This hierarchical design enables “goal-conditioned” reward shaping: the lower-layer reward is dynamically optimized based on the achievement of the primary task objective. It allows the agent to first master fundamental skills before progressively refining secondary objectives, leading to more structured and interpretable exploration.
  3.2. Two-Layered Reward Reinforcement Learning Framework
The two-layered reward reinforcement learning framework is depicted in 
Figure 3. This framework is designed to address the limitations of traditional fixed-weight reward shaping in complex control tasks. It decomposes the raw reward signal 
 obtained from the environment into two distinct components: (1) the upper-layer goal reward 
; and (2) the lower-layer optimizing reward 
. Thereby, it establishes an adaptive, goal-conditioned reward shaping mechanism that enables explainable and meaningful exploration.
As illustrated in 
Figure 3, the agent interacts with the outer environment, receiving state observations 
 and a composite reward 
. This reward 
 consists of two parts:
- (1)
- Upper-layer “goal reward”  - : This component represents the core task objective and corresponds to the task-specific tracking error  - , such as the deviation between the robot’s body pose and the reference motion trajectory. It serves as a fixed, static metric of task completion and forms the primary criterion for evaluating policy performance. In this work, - 
		where  -  denotes individual tracking-related reward terms,  -  are their weights, and  -  is the number of tracking objectives. 
- (2)
- Lower-layer “optimizing reward”  - : This component consists of auxiliary, manually designed sub-rewards that encourage desirable behaviors beyond basic task execution, such as motion smoothness, energy efficiency, and dynamic stability. Specifically, in this work, - 
        where  -  denotes penalty terms (e.g., on joint torques or posture deviations),  -  represents regularization terms (e.g., on action magnitude or velocity), and  -  are tunable weighting coefficients. These components act as “polishing” rewards that refine policy quality once the primary task objective is sufficiently achieved. 
As illustrated in 
Figure 4, during training, we employ a parallel simulation setup with a large number of environments (e.g., 4096 independent rollouts). First, all episodes are evaluated and ranked based on the upper-layer goal reward 
. The top 10% of episodes—those with the highest 
 values, corresponding to the best task completion performance—are selected as the candidate set. For example, with 4096 environments, the top 409 episodes are selected for further evaluation.
Each of these selected episodes is assigned an initial rank , which reflects its relative performance based solely on . Specifically,  denotes the episode with the highest , while  corresponds to the lowest-ranked episode within the top 10%. For instance,  indicates that the 992nd environment (among the 4096 total) achieved the best task performance and thus holds the top rank in the candidate set.
Next, within this filtered set of 409 episodes, the full composite reward  is computed for each episode, and the episodes are re-ranked based on this total reward. This re-ranking yields a new global rank , which reflects the episode’s position across all environments (not just the top 10%), meaning  can indeed exceed 409.
To evaluate the effectiveness of the current optimizing reward , we compute a rank deviation penalty that quantifies how much an episode’s relative standing degrades when  is included. Specifically, if an episode falls outside the top 10% (i.e., ) despite having been selected based on high , it suggests that  may be encouraging behaviors that do not align with overall task performance.
We define the following objective function to capture this effect:
        where 
 is the initial rank of the 
-th episode within the top 10% based on 
, 
 is its final global rank based on 
, and 
 is the indicator function, equal to 1 if 
 and 0 otherwise.
This penalty term  is applied only when an episode drops out of the top 10% after incorporating , indicating a misalignment between the optimizing reward and task success. Episodes that remain within the top 10% (i.e., ) receive no penalty, thereby encouraging exploration guided by  as long as it does not compromise the primary task objective. In this way, the framework ensures that improvements in auxiliary objectives contribute meaningfully to overall performance, promoting structured and goal-consistent exploration.
The value of 
 depends on the weighting parameters 
 and 
 used in the lower-layer reward: 
. To ensure that the shaping reward improves policy quality without compromising task completion, we perform online optimization of these weights to minimize 
. This optimization is carried out using a meta-heuristic algorithm. Specifically, the Grey Wolf Optimizer (GWO) is employed in this work. The optimization problem is formulated as:
        where 
 denotes the minimized value of the penalty objective, and 
 represent the optimal weight parameters that yield the best alignment between the optimizing reward 
 and the primary task goal. By solving this problem during training, the framework adaptively tunes the auxiliary reward weights in a goal-conditioned manner, ensuring that exploration is both effective and task-consistent.
This process enables goal-conditioned reward shaping: the optimization of auxiliary rewards is explicitly conditioned on the agent first achieving a sufficient level of task proficiency, as measured by the upper-layer reward . As a result, exploration guided by  is not arbitrary but is restricted to policies that already satisfy the fundamental task objective, ensuring that refinement is built upon a foundation of task success.
Importantly, this framework effectively mitigates the issue of spurious rewards. For example, a policy that remains completely static may achieve a high  (e.g., due to minimal action magnitude or joint velocity penalties), but it would fail to track the reference motion, resulting in a low . Such a policy would not be included in the top 10% of episodes and therefore would have a worse  value. This mechanism inherently filters out degenerate or misleading solutions that exploit auxiliary rewards without contributing to actual task performance.
Finally, the optimal shaping reward 
, computed using the tuned weights 
 and 
, is fed into the policy optimization algorithm—Proximal Policy Optimization (PPO), in this work—to update the agent’s policy network. This establishes a closed-loop training pipeline: task performance filters candidate policies, auxiliary rewards are optimized based on goal rewards, and the resulting shaped reward guides policy improvement. The complete procedure is summarized in the pseudo-code provided in Algorithm 1.
        
| Algorithm 1. The Pseudo Code of Two-Layered Reward Optimization via GWO | 
| Input: : Set of environment trajectories
 : Upper-layer goal reward
 : Lower-layer shaping reward
 : Search bounds for weights
 : Population size, max iterations, etc.
 | 
| Output: Optimal weight vector
 Begin:
 [initialize]
 | 
| 1: For each trajectory , : 2:         Compute
 3: Sort all episodes in descending order of
 4: Select top 10% (i.e., top 409) as candidate set
 5: Assign initial rank  to each , where  indicates the best
 [GWO optimization]
 6: Initialize population of wolves , randomly within
 7: Assign initial values to GWO’s controlling parameters
 8: while (maximum iterations not reached or convergence not achieved) do
 9:         For each wolf :
 10:             Compute composite reward  for all
 13:             Rank all 4096 episodes by → obtain global rank
 14:             Compute objective value:
 
 15:         Evaluate and rank wolves by
 16:         Update the best three solutions:
 17:         Update controlling parameters based on iteration
 18:         for each wolf  do
 19:             Update position using GWO update rule
 20:         end for
 21: end while
 22: return  (best solution)
 [Return and Apply Optimal Reward]
 23: let , compute final shaping rewards for all environments using :
 
 | 
| 24: Pass  to PPO for policy gradient computation and agent update End Algorithm.
 | 
In summary, the proposed two-layered reward framework enhances the adaptability and expressiveness of reward design in reinforcement learning. By decoupling task completion from performance refinement and enforcing priority-based optimization, it promotes structured exploration and ensures stable learning, particularly beneficial for high-dimensional, dynamic humanoid control tasks.
  4. Results and Discussion
  4.1. Application to Unitree G1 Motion Tracking
To validate the effectiveness of the proposed two-layered reward reinforcement learning framework, we evaluate its performance in a high-fidelity simulation using the Unitree G1 humanoid robot within the Isaac Gym environment.
The objective is to enable the robot to accurately track a diverse set of human-generated motion trajectories—such as walking, running, jumping, and kicking—while maintaining dynamic stability and energy efficiency.
In this work, we focus on enabling the Unitree G1 humanoid robot to imitate a user-defined gymnastic motion sequence, recorded by the author and referred to here as the “gym-motion” for brevity, as illustrated in 
Figure 5. The top sub-figure shows keyframes from the original human motion capture video, while the bottom sub-figure presents the corresponding motion retargeted onto the Unitree G1 robot. The sequence consists of a series of coordinated movements: starting from a neutral stance, the subject steps to the right (1), shifts weight with arms swinging outward (2), extends both arms sideways at shoulder height (3), steps leftward (4), transitions into a forward-facing pose with hands raised overhead (5–6), then returns to the right side with arms lowered (7–8), completing the motion cycle.
This gym-motion involves dynamic lateral movement, upper-body coordination, and precise timing between limb and body actions, making it a challenging-yet-representative task for evaluating the robot’s ability to replicate agile and dexterous human-like behaviors.
Successfully reproducing such a complex and fluid motion in simulation serves as a rigorous test of the proposed two-layered reward framework’s ability to learn and generalize high-dexterity locomotion skills under physical constraints.
The simulation is configured with 4096 parallel environments, enabling large-scale training with high sample efficiency. Each environment runs at a control frequency of 50 Hz, with a simulation timestep of 0.005 s (i.e., four physical steps per control step). The robot is actuated via PD controllers on all 23 actuated joints. Joint position and velocity limits are enforced based on the real robot’s specifications.
The observation space includes the robot’s base pose (position and orientation in world frame), base linear and angular velocities, joint positions and velocities, last action, and the reference motion state (projected 3D joint positions and velocities at the current phase). The action space consists of target joint positions (PD setpoints) relative to the current configuration.
We employ the two-layered reward structure introduced in 
Section 3:
- (1)
- The upper-layer goal reward  -  measures task completion accuracy and is defined as: - 
        where the component rewards are defined as follows: 
- (a)
- Body position tracking: - 
        where  -  denotes the global position error of the  - -th body link,  -  is the simulated position and  -  is the reference position,  -  is the number of tracked body links, and  -  is the temperature coefficient that controls the sensitivity to position errors. 
- (b)
- Body orientation (rotation) tracking: 
- Here,  is the rotation angle (in radians) between the reference and simulated orientations, represented as quaternions. This measures the angular discrepancy for each body link. 
- (c)
- Body linear velocity tracking: 
- Here,  is the linear velocity error between the simulated and reference values of the -th body link. 
- (d)
- Body angular velocity tracking: 
- Here,  is the angular velocity error. - The weights  balance the relative importance of each tracking component, and  are temperature coefficients that modulate the sharpness of the exponential penalty. - Here,  is a static value and is determined in advance. 
 
- (2)
- The lower-layer optimizing reward  -  promotes desirable behaviors such as stability, energy efficiency, and safety and is composed of: - 
        where  -  and  -  denote the individual penalty and regularization reward terms, respectively, designed to penalize undesirable behaviors and encourage smooth, natural motions. The weights  -  and  -  are automatically adjusted online during training using the Grey Wolf Optimizer (GWO), allowing the system to dynamically balance competing objectives. These reward weights  -  and  -  are constrained within  -  to prevent unbounded growth and ensure stable reward shaping. Detailed definitions of  -  and  -  are provided in  Table 1- . 
The specific values of the coefficients in reward functions are the same as those defined in [
30]; readers are referred to that work for further details.
Training is performed using the Proximal Policy Optimization (PPO) algorithm. The policy is trained for 2500 iterations with a mini-batch size of 4096, a learning rate of 
, and a GAE discount factor 
. Unless otherwise specified, all other PPO hyperparameters and the physical parameters of the Unitree robot are set according to [
30]. The parameters of the Grey Wolf Optimizer (GWO) are adopted from [
32]. For the baseline method, a static reward function is used with all weight coefficients set to 1. All experiments are carried out on a machine powered by an NVIDIA GeForce RTX 4070 Ti GPU.
  4.2. Simulation Results
We first examine the average episode length over training epochs, as shown in 
Figure 6, which reflects the robot’s ability to maintain balance and complete motion sequences without falling. The red curve represents the proposed two-layered reward framework, while the green dashed curve corresponds to a baseline method using a static reward.
Initially, the static reward method achieves a higher average episode length, reaching approximately 180 steps earlier than our approach. This is because the static reward prioritizes survival and stability, encouraging conservative behaviors such as standing still or minimal movement, which reduces the risk of termination due to falls. In contrast, the upper-layer reward in our framework incentivizes the robot to actively track the desired gym-motion trajectory, even at the cost of increased instability during early learning stages. As a result, it exhibits more exploratory behavior and experiences higher failure rates initially.
However, as training progresses beyond 500 epochs, the two-layered reward strategy surpasses the static reward in both maximum episode length and overall robustness. It consistently maintains an average episode length above 200 steps for over 1000 epochs, indicating that the agent has successfully learned a stable and dynamic locomotion policy capable of executing complex motions.
The result shown in 
Figure 6 demonstrates that the two-layered reward framework promotes valuable exploration in the early training stage by prioritizing motion tracking in the reward design, enabling more effective imitation behavior to emerge during the mid-training phase and ultimately achieving superior overall tracking performance.
Figure 7 presents the body tracking errors for both the upper and lower body links over training epochs, illustrating the performance of the two reward frameworks in imitating the target gym-motion.
 As shown in 
Figure 7, both the upper-body and lower-body tracking errors exhibit similar trends across the two methods: initially, the static reward approach achieves lower tracking error, particularly during the early stages of training. This is because the static reward emphasizes stability and penalizes large deviations from the current state, leading to conservative behavior.
In contrast, the two-layered reward framework, by elevating the priority of motion tracking through its upper-layer objective, actively encourages the agent to explore movements that align with the reference trajectory. As a result, the initial tracking error is higher due to increased exploratory actions and imperfect coordination. However, as training progresses beyond 500 epochs, the two-layered reward consistently outperforms the static reward, achieving significantly lower tracking errors in both the upper and lower body segments. This improvement reflects the agent’s growing ability to learn complex, coordinated motions while maintaining balance.
The overall trends in 
Figure 7 align well with the episode length trend shown in 
Figure 6, further validating the effectiveness of our framework. While the two-layered reward sacrifices short-term stability for long-term imitation fidelity, it ultimately enables more accurate and dynamic reproduction of the human gymnastic motion, demonstrating its superiority in learning high-dexterity behaviors.
Figure 8 quantifies the relative performance gain of the two-layered reward framework over the static baseline. Although early training exhibits high variance due to active exploration, the method surpasses the baseline after 500 epochs and maintains a consistent advantage. The average relative improvements reach 7.58% for upper-body and 10.30% for lower-body tracking, confirming that our approach effectively enhances tracking ability across both body segments.
 Figure 9, 
Figure 10, 
Figure 11 and 
Figure 12 illustrate the evolution of motion tracking performance for both the proposed two-layered reward framework and the static reward baseline across key training stages at 100, 500, 1000, and 2200 epochs. These snapshots provide a qualitative comparison of how each method learns to imitate the target gym-motion over time.
 Figure 9 depicts the motion tracking performance at 100 epochs. The two-layered reward agent attempts dynamic movements despite early falls, reflecting its prioritization of motion imitation over conservative stability. This behavior is consistent with the higher exploration levels observed in 
Figure 6, 
Figure 7 and 
Figure 8.
 As shown in 
Figure 10, both agents have learned to maintain an upright posture by 500 epochs. By 1000 epochs, as illustrated in 
Figure 11, the static reward agent exhibits motion lag and tends to perform movements in place, while the two-layered reward agent achieves timely and accurate tracking with small delay, demonstrating superior motion imitation capability.
As reported in 
Figure 12, when it comes to 2200 epochs, the two-layered reward agent achieves highly accurate and synchronized motion tracking with minimal delay. In contrast, the static reward agent still exhibits slight motion lag, indicating a slower response to dynamic movements. This demonstrates the effectiveness and superiority of the proposed two-layered reward framework in humanoid robot motion imitation.
  4.3. Discussion
The simulation results in the last sub-section demonstrate that the proposed two-layered reward framework outperforms the static reward baseline in human motion imitation on the Unitree G1 robot. Quantitatively, it achieves average improvements of 7.58% and 10.3% in upper-body and lower-body tracking performance, respectively.
To enhance the credibility of our results, we conducted a statistical significance analysis by evaluating the proposed method and baseline framework under five independent random seeds. The simulation results, shown in 
Figure 13, depict the mean performance as solid lines, and the shaded regions represent ±1 standard deviation across runs. The results demonstrate that our method consistently outperforms the static reward baseline throughout training. In terms of average performance, the proposed two-layered reward framework achieves a 4.25% improvement in upper-body tracking accuracy (the 
p-value 
) and a 4.98% improvement in lower-body tracking accuracy (the 
p-value 
), with both results statistically significant. This confirms the robustness and effectiveness of our two-layered reward framework.
Figure 14 shows the evolution of reward weights during training, with 
weight0 to 
weight5 corresponding to 
, 
, 
, 
, 
, and 
, respectively. The weights for 
, 
, 
, and 
 converge to around 0.9, indicating their high importance. Notably, 
 achieves a high final weight (
) with the highest mean value with 0.676, suggesting it is the most critical for the current motion tracking.
 The key advantage stems from the hierarchical reward design: by prioritizing motion imitation in the upper layer, the framework explicitly encourages the agent to learn the target motion trajectory before optimizing for secondary objectives such as stability or energy efficiency. This structured learning paradigm leads to more meaningful exploration—sacrificing short-term success rates during early training for faster convergence in motion tracking.
Indeed, the initial phase exhibits higher failure rates due to aggressive exploration driven by the imitation objective. However, this trade-off proves beneficial, as the two-layered approach consistently surpasses the baseline by the middle of training procedure, indicating accelerated learning and superior long-term performance. The improved sample efficiency suggests that the framework effectively guides the policy search toward task-relevant behaviors.
The proposed two-layered reinforcement learning framework shares a conceptual synergy with recent advances in learning-augmented planning, such as Diffusion Tree (DiTree) [
33], which combines the global completeness of sampling-based planners (SBPs) with the efficiency of diffusion policies as informed samplers. While DiTree focuses on kinodynamic motion planning in geometrically complex environments, our work addresses a complementary challenge: adaptive policy learning under dynamic task constraints in continuous control. Looking forward, an exciting direction is to integrate these paradigms and use a two-layered reward agent as the local policy within a DiTree-style planner, where the learned reward adaptation mechanism could dynamically reshape trajectory preferences based on terrain, task goals, or energy constraints. Such a hybrid architecture would combine global safety guarantees from SBPs with local behavioral adaptability, paving the way toward truly autonomous humanoid navigation in unstructured environments.
Nonetheless, there of course exist several limitations. First, the evaluated gym-motion is relatively simple and short in duration. For more complex or temporally extended motions, the initial performance instability may persist longer, potentially increasing training cost. Further research is needed to analyze and mitigate the negative impact of prolonged exploration phases, possibly through curriculum strategies.
Second, while the two-layered structure enhances learning effectiveness, online optimization of reward weights introduces additional computational overhead. As depicted in 
Figure 15, the average wall-clock time overhead of the GWO algorithm computation per PPO iteration is 3.72 ± 0.06 s. In comparison, each PPO update takes approximately 43 s on the same hardware. Thus, the additional overhead is less than 10%, which we consider acceptable for the observed performance gains. While the current implementation employs the Grey Wolf Optimizer (GWO) due to its simplicity and effectiveness in low-to-moderate dimensional spaces (≤25 dimensions), we acknowledge that traditional meta-heuristic algorithms may encounter scalability challenges when applied to significantly higher-dimensional reward weight spaces. In high-dimensional settings, the search space grows exponentially, increasing the risk of slow convergence and premature optimization. To address this limitation, our framework can be extended with advanced dimensionality-aware techniques such as the Effective Dimension Extraction Mechanism (EDEM) [
34]. EDEM is a novel mechanism designed to enhance meta-heuristic optimization in complex, high-dimensional problems by identifying and focusing search efforts on the most influential dimensions. Future work should investigate faster meta-heuristic algorithms other than GWO that are better suited to real-time reinforcement learning pipelines.
Thirdly, the 10% performance threshold is empirically motivated by preliminary training observations. While a higher threshold could yield marginally better results, it would also increase computational overhead. Therefore, we adopt 10% as a practical trade-off between performance improvement and computational efficiency. A more rigorous analysis of this threshold’s impact is left for our future work.
Finally, although validation is currently limited to the Isaac Gym simulation environment, the results sufficiently demonstrate the algorithmic advantages of the proposed framework. Deployment on the physical Unitree G1 robot involves sim-to-real transfer challenges, which are unrelated to the core contribution of this work. In follow-up studies, we may aim to develop a pipeline that bridges video-based motion capture to real-world humanoid robot execution based on the proposed two-layered reward framework.