Realization of Efficient Exploration by Self-Generating Evaluation Considering Curiosity and Fear Indices Based on Prediction Error
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis manuscript proposes a novel method for generating rewards in reinforcement learning, as an extension of Self-Generating Evaluation framework. The key proposal is the introduction of "curiosity" and "fear" indices, which are calculated based on both the magnitude of the prediction error and the change in the average prediction error over time. The stated goal is to create a more efficient exploration strategy.
While the core idea is interesting, manuscript suffers from several major flaws. The work is not reproducible, the methodology is poorly explained with a lack of concrete examples, and the experimental validation is insufficient to support the paper's claims.
My primary concern is the complete lack of reproducibility. The authors describe simulation experiments in Section 4, which confirms that a software implementation exists. However, neither the source code, the specific environment configurations, nor the raw data are provided. Without public access to the implementation, it is impossible for the research community to verify the results, build upon this work, or accurately compare it to other methods. This severely limits the paper's potential contribution. For a paper to have value, it must provide well-documented code that allows readers to reproduce the experiments.
The paper is exceptionally difficult to follow due to its high level of abstraction and lack of simple, concrete examples.
The introduction and methodology (Sections 1-3) present a mathematical framework, but the concepts are never grounded.
Section 3, the core of the proposal, introduces complex-looking equations for "curiosity" (Eq. 14) and "fear" (Eq. 15) without a single numerical walkthrough.
The reader is left to guess how a specific sensor input , its corresponding prediction error , and the change in the average prediction error actually combine to produce these evaluation values.
Simple, step-by-step example is essential to make the method understandable and to clarify the motivation compared to other approaches.
Experimental validation in Section 4 is insufficient to support the paper's claims of superiority.
The authors compare their proposed method only against their own "previous method".
The claim that this new "curiosity and fear" formulation is a superior method for exploration is therefore unsubstantiated. The authors are making a claim without proof.
Without a robust comparison to well-established, state-of-the-art intrinsic reward methods, the results are unconvincing. The paper currently reads as a conceptual proposal with a limited internal comparison, rather than a validated new algorithm. The conceptual distinctions seem like a "play of words" without strong empirical evidence to back them up.
Comments on the Quality of English LanguageThe English language quality requires significant copyediting. The manuscript is full with grammatical errors (e.g., "There are two reason" , "which is gived" ), awkward phrasing (e.g., "reawards is inadequacy" ), and typos (e.g., "Foward model" , "it is large, it is large" ). These errors frequently obscure the authors' intended meaning and make the paper difficult to read and assess
Author Response
Thank you for your various important comments.
We have revised the paper based on your comments.
We will incorporate the elements that could not be fully addressed in this revision into our future research.
Especially, regarding the experiment, we will conduct additional experiments.
The following are the revised contents.
Comments1:
The introduction and methodology (Sections 1-3) present a mathematical framework, but the concepts are never grounded.
Response1:
Thank you for pointing that out.
The content of the comments is important point for understanding both the previous method and the proposed method.
Therefore, explanation of design concept in Equation(1),(4),(14),(15) was added for this comment.
The following is an additional explanation.
Additional explanation regarding Equation (1) added from line 146 to line 149:
This design was intended to prompt therobot to avoid damage from stimuli.
The sensor input values for the stimuli tended to behigh when the robot was damaged by it.
Therefore, calculating the low evaluation values for the mean of the large input values can prompt the agent to avoid damaging the robot.
Additional explanation regarding Equation (4) added from line 164 to line 170:
This design was intended to promptthe robot to avoid unpredictable sensor inputs that were difficult for the agent to handle.
It can be considered that the prediction errors between the actual sensor input and the prediction sensor input tend to be large when the actually sensor input is unpredictable.
Therefore, calculating low evaluation values for large prediction errors can prompt the agent to avoid unpredictable sensor inputs.
The value of constant Si determines the change in the evaluation value.
Additional explanation regarding Equation (14) added from line 275 to line 280:
Equation (14) is designed by combining the two Gaussian functions.
The purpose of this is to calculate higher evaluation values as the values of | Di (nt )| and |∆Gi (nt )| approach a medium value, as shown in Figure 5.
Additionally, this enabled the calculation of a high or medium evaluation value when the values of | Di (nt )| and |∆Gi (nt )| were close to zero.
This enabled the calculation of a low evaluation value when the values of | Di (nt )| and |∆Gi (nt )| were close to the maximum value.
Additional explanation regarding Equation (15) added from line 316 to line 320:
Equation (15) is designed by combining the two Gaussian functions.
The purpose of this was to calculate the lower evaluation values as the distance increased from the center to zero, in accordance with Figure 5.
Additionally, this enables different evaluation calculation tendencies for when the same sign in Di (nt ) and ∆Gi (nt ) occurs, and when different sign in Di (nt ) and ∆Gi (nt ) occur.
Comments2:
Section 3, the core of the proposal, introduces complex-looking equations for "curiosity" (Eq. 14) and "fear" (Eq. 15) without a single numerical walkthrough.
The reader is left to guess how a specific sensor input, its corresponding prediction error, and the change in the average prediction error actually combine to produce these evaluation values.
Response2:
Thank you for pointing that out.
As you pointed out, it is difficult to understand the evaluation calculations in the proposed method based solely on the formula and its concept.
Especially, Equation (14) and Equation (15) calculate the evaluation value based on two types of value, making it difficult to guess the evaluation value calculated for each cases.
Therefore, Specific examples regarding the calculation of the evaluation values in Equation (14) and Equation (15) were provided.
Regarding Equation 14, sumple of the calculated evaluation value for curiosity as Table 1 was added between lines 280 and 281, and explanation of Table 1 was added starting from line 281.
Regarding Equation 15, sumple of the calculated evaluation value for curiosity as Table 2 was added between lines 323 and 324, and explanation of Table 2 was added starting from line 324.
Comments3:
Experimental validation in Section 4 is insufficient to support the paper's claims of superiority.
The authors compare their proposed method only against their own "previous method".
The claim that this new "curiosity and fear" formulation is a superior method for exploration is therefore unsubstantiated. The authors are making a claim without proof.
Response3:
Thank you for important pointing that out.
This point is important to proof the usefulness of the proposed method.
We foucused to improve the efficiency of exploration in the our previous method in this study.
Therefore, the performance of the proposed method was not been compared with that of the conventional methods, which is ICM and RND, in this paper.
However, comparison with conventional method is important to demonstrate of the usefulness for the proposed method.
Especially, the porposed method uses the term "culiosity" in the same way as ICM and RND, and has "fear" as an element of originality in this study.
The target of the study, which is realizetion of the efficiently exprolation, is also simillar to the conventional method.
Therefore, comparing performance between conventional methods and the proposed method is considered an important factor.
However, the results will not be ready in time.
Therefore, the following statement has been added to the content of this comment as future work from lines 510 to 519:
In future work, experiments should be conducted to compare the performance of the proposed method with those of the conventional method, ICM, and RND.
This study aimed to achieve a more efficient exploration of the proposed method than previous methods by considering curiosity and fear.
Therefore, the proposed method was based only on the method described in this study.
However, a comparison with the previous method alone was insufficient to demonstrate the usefulness of the proposed method's exploration performance in the method for efficient exploration.
To further demonstrate the usefulness of the proposed method, it should be compared with ICM and RND, which are conventional method that use curiosity in standard reinforcement-learning benchmark environments.
As another future work,Curiosity, which is an immediate reaction to new stimuli, is enumerated.
Reviewer 2 Report
Comments and Suggestions for AuthorsThis paper introduces a novel method for generating intrinsic rewards in reinforcement learning by explicitly modeling "curiosity" and "fear." The goal is to improve an agent's exploration efficiency while simultaneously ensuring it avoids dangerous or unpredictable states. The core idea—differentiating between novelty from insufficient learning and inherent environmental unpredictability—is conceptually compelling and addresses a significant challenge in the field. However, the study's contributions are currently limited by a complex methodology that lacks sufficient justification and a too narrow empirical evaluation to validate the proposed method's effectiveness in a broader context.
Strengths:
- This paper, in a novel conceptual framework, formalizes the separation of intrinsic motivation into two distinct indices: "curiosity" and "fear". The work moves beyond monolithic novelty-seeking bonuses by designing mechanisms that specifically promote or discourage exploration based on the nature of the prediction error.
- The research tackles the critical and practical problem of safe and efficient exploration. The paper's focus on creating agents that intelligently moderate their exploratory behavior is a valuable research direction.
- Analysing the change in prediction error over time, rather than just its instantaneous magnitude, is a sophisticated approach. This allows the system to distinguish between novel states simply because the agent has not learned them yet (which should be explored) and inherently stochastic or unpredictable states (which may be dangerous). This is a thoughtful contribution to the design of intrinsic reward signals.
Weaknesses:
- The proposed method introduces a significant number of new equations and hyperparameters to calculate the evaluation values for curiosity and fear. The specific mathematical formulations, particularly using logarithmic transformations and difference calculations, appear somewhat ad hoc. The paper does not provide a strong theoretical grounding or an ablation study to justify why these specific design choices are optimal or necessary, making the method's complexity a potential barrier to adoption and tuning.
- The empirical evaluation is conducted exclusively within a simple, two-dimensional, discrete grid-world environment. While useful for initial validation, such an environment does not represent the complex, high-dimensional, and continuous state spaces where exploration challenges are most pronounced. The results from this simplified setting may not generalize to more challenging and standard benchmark problems.
- The proposed method is only compared against a "prototype" and a previous iteration of the authors' own work. There is no comparison against well-established and widely used intrinsic motivation algorithms like the Intrinsic Curiosity Module (ICM) or Random Network Distillation (RND), which are considered standard baselines in the field. Without this comparison, it is impossible to properly situate the method's performance and assess its contribution to the existing body of research.
- The introduction features several instances of grouped references, making it difficult to understand the specific contribution of each cited work, and it suggests that some citations may not be essential to the core argument.
My Recommendations
- It is essential to evaluate the method on standard reinforcement learning benchmark environments (games or tasks) to demonstrate its value. This would provide a direct and fair comparison of its performance against other established methods.
- The authors should perform ablation studies to justify the model's complexity. This would involve systematically removing or altering components of the curiosity and fear calculations to measure their individual impact on performance.
- The manuscript would benefit from a clearer and more intuitive explanation of the rationale behind the chosen mathematical formulas.
- The current title is overly long. The subtitle should be integrated into the abstract or introduction.
- Over-reliance on the Term "Robot" throughout the text.
- Many expressions in the text lack of scientific tone.
- Grammatical errors such as on Line 88 "...generates a evaluation value..."
Author Response
Thank you for your various important comments.
We have revised the paper based on your comments.
We will incorporate the elements that could not be fully addressed in this revision into our future research.
Especially, regarding the experiment, we will conduct additional experiments.
The following are the revised contents.
Comments1:
It is essential to evaluate the method on standard reinforcement learning benchmark environments (games or tasks) to demonstrate its value.
This would provide a direct and fair comparison of its performance against other established methods.
Response1:
Thank you for important pointing that out.
This point is important to proof the usefulness of the proposed method.
We foucused to improve the efficiency of exploration in the our previous method in this study.
Therefore, the performance of the proposed method was not been compared with that of the conventional methods, which is ICM and RND, in this paper.
Additionally, regarding the experimental environment, we employed a custom-built environment that previous method were unable to adapt to.
Especially, the porposed method uses the term "culiosity" in the same way as ICM and RND, and has "fear" as an element of originality in this study.
The target of the study, which is realizetion of the efficiently exprolation, is also simillar to the conventional method.
Therefore, comparing performance between conventional methods and the proposed method is considered an important factor.
However, the results will not be ready in time.
Therefore, the following statement has been added to the content of this comment as future work from lines 510 to 519:
In future work, experiments should be conducted to compare the performance of the proposed method with those of the conventional method, ICM, and RND.
This study aimed to achieve a more efficient exploration of the proposed method than previous methods by considering curiosity and fear.
Therefore, the proposed method was based only on the method described in this study.
However, a comparison with the previous method alone was insufficient to demonstrate the usefulness of the proposed method's exploration performance in the method for efficient exploration.
To further demonstrate the usefulness of the proposed method, it should be compared with ICM and RND, which are conventional method that use curiosity in standard reinforcement-learning benchmark environments.
As another future work,Curiosity, which is an immediate reaction to new stimuli, is enumerated.
Comments2:
The authors should perform ablation studies to justify the model's complexity.
This would involve systematically removing or altering components of the curiosity and fear calculations to measure their individual impact on performance.
Response2:
Thank you for pointing that out.
The content of this comment is important point for understanding the proposed method.
Equation (14) and Equation (15) in the proposed method appears complex.
Especially, Equation (14) and Equation (15) calculate the evaluation value based on two types of value, making it difficult to guess the evaluation value calculated for each cases.
Therefore, it is considered important to demonstrate the performance of each component within complex models.
However, the results of the ablation experiment will not be ready in time.
Therefore, although it is not an intuitive solution, Specific examples regarding the calculation of the evaluation values in Equation (14) and Equation (15) were provided.
Regarding Equation 14, sumple of the calculated evaluation value for curiosity as Table 1 was added between lines 280 and 281, and explanation of Table 1 was added starting from line 281.
Regarding Equation 15, sumple of the calculated evaluation value for curiosity as Table 2 was added between lines 323 and 324, and explanation of Table 2 was added starting from line 324.
Comments3:
The manuscript would benefit from a clearer and more intuitive explanation of the rationale behind the chosen mathematical formulas.
Response3:
Thank you for pointing that out.
The content of the comments is important point for understanding both the previous method and the proposed method.
Therefore, explanation of design concept in Equation(1),(4),(14),(15) was added for this comment.
The following is an additional explanation.
Additional explanation regarding Equation (1) added from line 146 to line 149:
This design was intended to prompt therobot to avoid damage from stimuli.
The sensor input values for the stimuli tended to behigh when the robot was damaged by it.
Therefore, calculating the low evaluation values for the mean of the large input values can prompt the agent to avoid damaging the robot.
Additional explanation regarding Equation (4) added from line 164 to line 170:
This design was intended to promptthe robot to avoid unpredictable sensor inputs that were difficult for the agent to handle.
It can be considered that the prediction errors between the actual sensor input and the prediction sensor input tend to be large when the actually sensor input is unpredictable.
Therefore, calculating low evaluation values for large prediction errors can prompt the agent to avoid unpredictable sensor inputs.
The value of constant Si determines the change in the evaluation value.
Additional explanation regarding Equation (14) added from line 275 to line 280:
Equation (14) is designed by combining the two Gaussian functions.
The purpose of this is to calculate higher evaluation values as the values of | Di (nt )| and |∆Gi (nt )| approach a medium value, as shown in Figure 5.
Additionally, this enabled the calculation of a high or medium evaluation value when the values of | Di (nt )| and |∆Gi (nt )| were close to zero.
This enabled the calculation of a low evaluation value when the values of | Di (nt )| and |∆Gi (nt )| were close to the maximum value.
Additional explanation regarding Equation (15) added from line 316 to line 320:
Equation (15) is designed by combining the two Gaussian functions.
The purpose of this was to calculate the lower evaluation values as the distance increased from the center to zero, in accordance with Figure 5.
Additionally, this enables different evaluation calculation tendencies for when the same sign in Di (nt ) and ∆Gi (nt ) occurs, and when different sign in Di (nt ) and ∆Gi (nt ) occur.
Comments4:
The current title is overly long. The subtitle should be integrated into the abstract or introduction.
Response4:
Thank you for pointing that out.
The previous title was too detailed, so I changed it.
The new title is "Realization of efficient exploration by Self-generating evaluation considering curiosity and fear indices based on prediction error"
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThis is the second version of the manuscript, revised based on previous feedback. The authors have incorporated some additional explanations for equation designs, numerical examples in tables for curiosity and fear calculations, and a future work section on broader comparisons.
These changes partially improve clarity and address some abstraction issues.
However, the core concern of reproducibility remains entirely unaddressed. No source code, data, environment configurations, or hyperlinks to repositories (GitHub) have been provided, despite the paper relying on simulation experiments to demonstrate effectiveness.
In a computational field like reinforcement learning, this omission severely undermines the paper's scientific value, as results cannot be verified, replicated, or built upon. Other issues, such as weak experimental validation and English quality, persist.
The authors' response to previous comments does not mention reproducibility, focusing instead on explanations and future experiments. This suggests it was not prioritized in revisions.
Without public code (via GitHub), raw data, an appendix with implementation details (hyperparameters, environment setups), the research community cannot independently verify the claims.
For RL methods, reproducibility is essential due to sensitivity to random seeds, architectures, and environments.
English Language Quality: Improvements are minimal; errors persist.
Awkward phrasing (e.g., "prompts eagerness to learn and explores behaviors") and repetitions obscure meaning.
Professional editing is required.
References: Still inappropriate; ICM and RND are cited but not empirically engaged.
Comments on the Quality of English LanguageThe English language quality requires editing.
Author Response
Thank you for commenting on our paper once again.
In response to this comment, our company took the following actions.
Comment1:
Without public code (via GitHub), raw data, an appendix with implementation details (hyperparameters, environment setups), the research community cannot independently verify the claims.
For RL methods, reproducibility is essential due to sensitivity to random seeds, architectures, and environments.
Response1:
Thank you for important pointing that out.
We also think that reproducibility is important in study.
Therefore, we have implemented multiple revisions regarding this point.
Firstly, we have added a new figure showing a detailed overview of our SGE system.
This figure is located between lines 114 and 115 (see Figure 1).
We think this will make it easier for others to build systems similar to ours.
Secoundly, We wrote additional information regarding the experimental setting.
Detailed hyperparameter information in experiment using two-dimensional dynamic environment is provided in Table 7 immediately following Table 6.
We have also added the following additional explanation about the experimental environment between lines 447 and 448:
"This environment is surrouncded by walls on all sides.
If the agent collides witha a wall, the agent returns to the state one step prior."
Finally, We submitted the sensor data, evaluation data, and detailed experimental results obtained from the published experiments as Supplementary File.
Addendum: The Result of sensor data and evaluation value data were too large to upload.
Therefore, only the trial-by-trial reward and Q-value data related to reinforcement learning were uploaded.
Comment2:
Still inappropriate; ICM and RND are cited but not empirically engaged.
Response2:
Thank you for pointing that out.
In response to this comment, we have reduced the number of citations to papers on ICM and RND, as we have not conducted comparative experiments.
However, we wanted to introduce ICM and RND as related studies, so we remainded citations to the original paper and recent research trends.
Reviewer 2 Report
Comments and Suggestions for AuthorsThank you to the authors for their responsiveness in fixing and clarifying the issues raised in my previous review; however, the introduction still contains large, grouped citation blocks, making it impossible to understand the specific contribution of each reference.
Author Response
Comment1:
The introduction still contains large, grouped citation blocks, making it impossible to understand the specific contribution of each reference.
Response1:
Thank you for your comments regarding the references.
In response to this comment, We reduced the number of citations.
After the revision, citations for papers on ICM and RND were limited to original research and recent research trends.
We also slightly reduced the number of citations in other grouped citation blocks.
Because it seemed that there were many citations of papers for a single word.
However, several grouped citation blocks is remainded to introduce various cases.

