Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite

Open AccessArticle

Peer-Review Record

Social Learning-Enhanced Deep Reinforcement Learning Through Behavioral Observation

Electronics 2026, 15(13), 2940; https://doi.org/10.3390/electronics15132940 (registering DOI)

by Mehmet Dincer Erbas^*

and Ceren Gulen

Reviewer 1: Anonymous

Reviewer 2:

Wentao Zhang

Reviewer 3: Anonymous

Reviewer 4: Anonymous

Electronics 2026, 15(13), 2940; https://doi.org/10.3390/electronics15132940 (registering DOI)

Submission received: 21 May 2026 / Revised: 30 June 2026 / Accepted: 1 July 2026 / Published: 5 July 2026

(This article belongs to the Section Artificial Intelligence)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The manuscript presents a current and scientifically significant topic related to Social Learning-Enhanced Deep Reinforcement Learning (SLDRL) in multi-agent systems. The proposed approach combines a DRL layer with a social learning mechanism for storing, selecting and selectively using observed behavioral sequences. The article is well structured and includes a comparison with pure DRL and behavioral cloning, as well as experiments in grid-world and CartPole environments.
Among the main advantages of the manuscript are the clear motivation, the good description of the SLDRL architecture, the presence of several experimental scenarios and the use of statistical analysis. The presented results show that the proposed method has the potential to accelerate the learning process and improve the agent's performance compared to the baseline approaches. The work is interesting and can contribute to the development of reinforcement learning and adaptive multi-agent systems.
In order to further improve the manuscript before publication, I recommend that the authors pay attention to the following aspects:
1. The novelty of the proposed method could be distinguished more clearly from existing approaches such as imitation learning, behavioral cloning, learning from demonstrations and GAIL. It would be useful to add a more systematic comparison in the Related Work section, showing which features are specific to SLDRL, e.g. selective enactment, entropy-based reinforcement of observed behaviors, lack of access to the internal states of the demonstrating agent and a dynamic balance between social and individual learning.
2. The multi-agent aspect of the study could be clarified more categorically. Some of the experiments seem closer to a scenario with one learning agent and a demonstrating agent. The authors should clarify whether the proposed approach is intended for fully decentralized multi-agent learning, for teacher–learner scenarios or for observational learning in a shared environment.
3. It is recommended to expand the methodological justification for using entropy reduction as an internal reinforcement signal. The authors could explain more clearly why the reduction of action-selection entropy is taken as an indicator of useful learning, as well as how the risk of premature convergence to a suboptimal policy is avoided.
4. The CartPole experiment could be described more precisely. Although the environment has a continuous state space, the action space remains discrete. Therefore, it is good to clearly indicate in the manuscript that the method is validated in a continuous-state/discrete-action task, without creating the impression of full validation in continuous action-space control problems.
5. It would be useful to add additional discussion on the contribution of the individual components of SLDRL. A comparison with methods such as PPO, Actor-Critic, Double DQN, Dueling DQN or Rainbow DQN would provide a more convincing assessment of the contribution of the proposed social learning layer. If such comparisons are beyond the scope of the present study, this can be clearly stated as a limitation and a possible direction for future work.
6. The figures could be improved in terms of readability, resolution, font size, and clarity of legends. This is especially important for graphs representing learning curves and the frequency of imitate/enact actions.
7. The statistical analysis is a strong point of the paper. For even greater clarity, it would be good for the authors to indicate the number of independent experimental repetitions, the exact time intervals with statistically significant differences, whether the assumptions for the paired t-test were checked, and how the Benjamini–Hochberg correction for multiple comparisons over time was applied.
8. The computational cost of the Social Learning layer could be quantified more clearly. The authors indicate that the additional overhead is low, but there is no detailed comparison in terms of runtime, memory cost, or computational complexity. A comparison between DRL and SLDRL in terms of training time, memory usage, or additional operations would increase the practical relevance of the study.
9. The limitations of the study could be presented in more detail. The experiments were conducted in simulated environments with accurate observations and controlled transfer of behavioral sequences. In real robotic or distributed multiagent systems, factors such as observation noise, partial observability, asynchronous interactions, sensor uncertainty, and imperfect action correspondence can significantly affect the results.
10. The conclusion is generally supported by the presented results, but it is worth noting that the method has been validated in a sparse-reward grid-world environment and in the continuous-state/discrete-action CartPole problem, while its broader validation remains a subject of future research.

Overall, the manuscript has scientific value and is appropriate for the thematic scope of the journal. The proposed SLDRL approach is interesting and promising. After further clarification of the methodological aspects, improvement of the presentation of the results, and clearer formulation of the limitations, the article can be published. I recommend its acceptance after minor corrections.

Author Response

We sincerely thank the reviewers for their careful evaluation of our manuscript and for their constructive comments and suggestions. We appreciate the time and effort devoted to reviewing our work. The feedback provided has helped us improve the clarity, scope, and overall quality of the manuscript. In the revised version, we have carefully considered all comments and implemented corresponding changes throughout the manuscript. Detailed responses to each comment are provided below.

Comment 1: 1. The novelty of the proposed method could be distinguished more clearly from existing approaches such as imitation learning, behavioral cloning, learning from demonstrations and GAIL. It would be useful to add a more systematic comparison in the Related Work section, showing which features are specific to SLDRL, e.g. selective enactment, entropy-based reinforcement of observed behaviors, lack of access to the internal states of the demonstrating agent and a dynamic balance between social and individual learning.

Response 1: Response: We thank the reviewer for this valuable suggestion. We agree that the distinction between the proposed framework and existing imitation-based and socially guided learning approaches could be articulated more clearly. To address this point, we revised the Related Work section to provide a more systematic conceptual comparison between the proposed approach and existing methods, including behavioral cloning (BC), learning from demonstrations (LfD), inverse reinforcement learning (IRL), Generative Adversarial Imitation Learning (GAIL), and learning-from-observation approaches. In the revised manuscript, we clarified how these approaches differ in terms of the source of socially acquired information and the mechanisms through which such information is incorporated into learning. In particular, we emphasized that the proposed framework treats socially observed behaviors as reusable behavioral candidates whose usefulness is evaluated through subsequent interaction with the environment, rather than relying on direct reproduction of demonstrated behaviors, reward inference, or predefined demonstration utilization strategies. These modifications can be found page 5, paragraph 2, line 201.

Comment 2: 2. The multi-agent aspect of the study could be clarified more categorically. Some of the experiments seem closer to a scenario with one learning agent and a demonstrating agent. The authors should clarify whether the proposed approach is intended for fully decentralized multi-agent learning, for teacher–learner scenarios or for observational learning in a shared environment.

Response 2: Response: We thank the reviewer for this important observation. We agree that the original manuscript could benefit from a clearer distinction regarding the intended scope of the proposed framework and the interpretation of its social-learning setting. To improve clarity, we revised the manuscript to more explicitly position the present study as an evaluation of observational social-learning scenarios rather than as a validation of fully decentralized multi-agent learning. Although the proposed mechanism remains conceptually extensible to broader multi-agent settings, the current experiments focus on settings in which a learning agent selectively observes and reuses behaviors acquired from interaction with other agents. To better align the manuscript scope with the experimental validation presented, we also revised the manuscript title and updated the keywords by replacing “multi-agent systems” with terminology that more accurately reflects the observational learning perspective adopted in this study. Additionally, we revised the Conclusion section to clarify that extensions toward larger multi-agent environments, decentralized learning scenarios, and partially observable settings remain directions for future investigation rather than claims validated in the current work. These modifications can be found at title, at keywords and page 28, paragraph 1, line 1048.

Comment 3: 3. It is recommended to expand the methodological justification for using entropy reduction as an internal reinforcement signal. The authors could explain more clearly why the reduction of action-selection entropy is taken as an indicator of useful learning, as well as how the risk of premature convergence to a suboptimal policy is avoided.

Response 3: We thank the reviewer for this valuable suggestion. To further evaluate the contribution of the individual mechanisms introduced in the proposed SLDRL framework, we performed an additional component ablation analysis and included the results in Appendix B.4. In this analysis, we evaluated two ablated variants of SLDRL while preserving the same DQN architecture, training procedure, and experimental setup used in the main discrete-state experiments. In the first variant, the state similarity matching mechanism was removed, allowing socially acquired behaviors to be enacted without contextual filtering. In the second variant, the entropy-based reinforcement mechanism was removed, preventing adaptive adjustment of behavior reuse probabilities based on entropy reduction.

The results show that removing either component leads to a statistically significant reduction in learning performance relative to the complete SLDRL configuration. The reduction is more pronounced when entropy-based reinforcement is removed, indicating that this mechanism plays an important role in the adaptive selection and reuse of socially acquired behaviors. At the same time, removing state similarity matching also results in a statistically significant performance decrease, demonstrating that contextual alignment contributes meaningfully to the effective utilization of observed behaviors. Since SLDRL is designed as a social learning enhancement layer operating on top of DQN rather than a standalone replacement for reinforcement learning, removing individual social-learning components is not expected to prevent learning entirely. Instead, the purpose of the ablation analysis is to quantify the contribution of each mechanism to learning efficiency. The complete SLDRL configuration consistently achieved the best performance across training, supporting the complementary contribution of both mechanisms. These modifications can be found at page 10, paragraph 1, line 403 and Appendix B.4.

Comment 4: 4. The CartPole experiment could be described more precisely. Although the environment has a continuous state space, the action space remains discrete. Therefore, it is good to clearly indicate in the manuscript that the method is validated in a continuous-state/discrete-action task, without creating the impression of full validation in continuous action-space control problems.

Response 4: We thank the reviewer for this valuable observation. We agree that the original wording could create ambiguity regarding the interpretation of the CartPole experiment and unintentionally suggest validation in continuous action-space control settings. To improve clarity, we revised the manuscript to explicitly distinguish between continuous-state and continuous action-space environments. In particular, we clarified in the description of the CartPole experiments that the environment used in this study represents a continuous-state/discrete-action setting. We further emphasized that the challenge addressed in this experiment arises from state similarity and behavioral reuse under continuous state representations rather than from continuous action control. To maintain consistency throughout the manuscript, we also revised the relevant descriptions in the Abstract, the experimental section describing the CartPole environment, and the Conclusion section to avoid creating the impression of full validation in continuous action-space problems. These modifications can be found at abstract, page 22, paragraph 1, line 754, page 21, paragraph 2, line 763, 777, page 22, paragraph 1, line 789, page 27, paragraph 2, line 965.

Comment 5: 5. It would be useful to add additional discussion on the contribution of the individual components of SLDRL. A comparison with methods such as PPO, Actor-Critic, Double DQN, Dueling DQN or Rainbow DQN would provide a more convincing assessment of the contribution of the proposed social learning layer. If such comparisons are beyond the scope of the present study, this can be clearly stated as a limitation and a possible direction for future work.

Response 5: We thank the reviewer for this valuable suggestion. We agree that evaluating the proposed social learning layer together with additional reinforcement learning architectures could provide a broader assessment of its contribution and generality.

However, conducting comprehensive experimental comparisons across multiple reinforcement learning algorithms would substantially extend the scope of the present study, whose primary objective is to introduce and validate the proposed social learning mechanism under controlled conditions using a DQN-based implementation. To address this point, we revised the Conclusion section to more clearly explain the rationale for focusing on DQN and to emphasize that the objective of the present work was to evaluate the contribution of the proposed social learning layer rather than to establish superiority over alternative reinforcement learning architectures. We additionally clarified that broader validation across methods such as PPO, Actor–Critic, Double DQN, Dueling DQN, and Rainbow DQN remains an important direction for future research. These modifications can be found at page 27, paragraph 5, line 997 and page 28, paragraph 2, line 1015.

Comment 6: 6. The figures could be improved in terms of readability, resolution, font size, and clarity of legends. This is especially important for graphs representing learning curves and the frequency of imitate/enact actions.

Response 6: We thank the reviewer for this helpful suggestion.

In the revised manuscript, all figures were regenerated and updated to improve resolution, readability, and consistency.

Comment 7: 7. The statistical analysis is a strong point of the paper. For even greater clarity, it would be good for the authors to indicate the number of independent experimental repetitions, the exact time intervals with statistically significant differences, whether the assumptions for the paired t-test were checked, and how the Benjamini–Hochberg correction for multiple comparisons over time was applied.

Response 7: We thank the reviewer for this suggestion. We agree that additional details would improve the transparency and reproducibility of the statistical analysis.

In the revised manuscript, we expanded the description of the statistical evaluation procedure in the experimental methodology section. Specifically, we explicitly reported the number of independent experimental repetitions used in each experiment (n = 120 for discrete-state experiments and n = 100 for continuous-state experiments unless otherwise stated). We also clarified how statistical significance was evaluated across training. Pairwise comparisons were performed independently at each evaluation point using paired t-tests, and the suitability of the parametric analysis was assessed through inspection of paired differences. To improve robustness against possible deviations from normality assumptions, all findings were additionally verified using the Wilcoxon signed-rank test. Furthermore, we expanded the description of the Benjamini–Hochberg (BH) correction procedure and clarified that the correction was applied separately for each family of temporal comparisons using a false discovery rate threshold of 0.05. To improve interpretability, the revised Results section now explicitly reports representative temporal intervals in which statistically significant differences were observed. The reported significant intervals remained unchanged after applying the BH correction. These modifications can be found at page 11, paragraph 3, line 451, page 12, paragraph 3, line 501 and page 14, paragraph 3, line 559.

Comment 8: 8. The computational cost of the Social Learning layer could be quantified more clearly. The authors indicate that the additional overhead is low, but there is no detailed comparison in terms of runtime, memory cost, or computational complexity. A comparison between DRL and SLDRL in terms of training time, memory usage, or additional operations would increase the practical relevance of the study.

Response 8: We thank the reviewer for this suggestion and agree that clarifying the practical computational implications of the proposed social learning layer improves the applicability and interpretability of the framework. In the revised manuscript, we added a dedicated computational overhead analysis to the Conclusion section and reported a representative runtime comparison between pure DRL and SLDRL. To minimize confounding effects introduced by different interaction scenarios, the analysis was conducted using the expert trajectory setting and measured over 10 independent repetitions under identical hardware and software conditions. The results showed that the average training time increased from 205 ± 10 s for the pure DRL baseline to 210 ± 12 s for SLDRL, corresponding to an approximate overhead of only 2.4%. We additionally clarified that no noticeable increase in memory consumption was observed during representative runs.

To improve methodological clarity, we also emphasized that these runtime measurements should be interpreted as implementation-level training overhead rather than isolated algorithmic complexity estimates, since wall-clock time includes both simulation execution and learning updates. Furthermore, we clarified that the proposed social learning layer does not modify the underlying DRL optimization process and does not introduce additional trainable neural networks; instead, the additional computation originates from bounded behavioral-sequence storage, state similarity evaluation, and selective enactment operations. These modifications can be found at page 28, paragraph 3, line 1026.

Comment 9: 9. The limitations of the study could be presented in more detail. The experiments were conducted in simulated environments with accurate observations and controlled transfer of behavioral sequences. In real robotic or distributed multiagent systems, factors such as observation noise, partial observability, asynchronous interactions, sensor uncertainty, and imperfect action correspondence can significantly affect the results.

Response 9: Thank you for this valuable comment. We agree that the limitations of evaluating SLDRL under idealized simulation assumptions were not sufficiently emphasized in the original manuscript.

To address this concern, we revised the Conclusion section to clarify that the present study was intentionally conducted in controlled simulation environments with accurate observations and deterministic transfer of behavioral sequences, and that the reported findings should therefore be interpreted as an initial validation of the proposed framework rather than evidence of applicability to real robotic systems.

We additionally expanded the discussion of future research directions to explicitly acknowledge that validation in physical robotic platforms with imperfect sensing and actuation remains necessary. To further address this limitation, we also included an additional robustness analysis (Appendix B.5) in which artificially perturbed behavioral demonstrations were introduced during imitation. The results provide a preliminary assessment of the framework under imperfect behavioral information while remaining distinct from real-world robotic validation. These modifications can be found at page 10, paragraph 1, line 403, page 29, paragraph 3, line 1075 and Appendix B.5.

Comment 10: 10. The conclusion is generally supported by the presented results, but it is worth noting that the method has been validated in a sparse-reward grid-world environment and in the continuous-state/discrete-action CartPole problem, while its broader validation remains a subject of future research.

Response 10: We thank the reviewer for this helpful observation and agree that the conclusions should remain closely aligned with the scope of the experimental validation. In the revised manuscript, we refined the final statements in the Conclusion section to more explicitly reflect the environments evaluated in this study. Specifically, we replaced broad references to “discrete and continuous environments” with a more precise description indicating that the proposed framework was validated in a sparse-reward discrete environment and a continuous-state/discrete-action CartPole benchmark. We additionally moderated broader generalization claims and clarified that, although the obtained results demonstrate the effectiveness of SLDRL across different state-space structures and reward settings considered in this work, broader validation across more complex environments, alternative reinforcement learning algorithms, and real-world applications remains an important direction for future research. These modifications can be found at page 29, paragraph 2, line 1062.

Reviewer 2 Report

Comments and Suggestions for Authors

（1）The experiments are conducted only on two relatively simple environments (a 10×10 grid foraging task and CartPole), so the authors should add higher-dimensional or visual-input environments to substantiate the scalability claims made throughout the paper.

（2）The underlying reinforcement learning algorithm is limited to DQN, yet the paper repeatedly argues the framework generalizes to PPO, SAC, DDPG, and others, so at least one additional algorithm should be tested to empirically support this generality claim.

（3）The statistical analysis reports only significance intervals and p-values but omits effect sizes and a quantitative comparison of final converged performance, which should be added to more completely characterize the method's advantages.

（4）Behavioral cloning is the only imitation-learning baseline used, while the stronger methods mentioned in the related work (such as GAIL and DIRL) are absent, so at least one modern imitation-learning baseline should be added to strengthen the comparison.

（5）The "perfect imitation" assumption (exact, noise-free copying of action sequences) is unrealistic in practical settings, so the authors should add robustness experiments with observation noise or action perturbations.

Author Response

Comment 1: The experiments are conducted only on two relatively simple environments (a 10×10 grid foraging task and CartPole), so the authors should add higher-dimensional or visual-input environments to substantiate the scalability claims made throughout the paper.

Response 1: We thank the reviewer for this valuable comment and agree that evaluation in higher-dimensional or visual-input environments would provide additional evidence regarding scalability and broader applicability.

The primary objective of the present study, however, was not to establish comprehensive scalability across complex reinforcement learning domains, but rather to introduce and analyze the proposed social learning mechanism in controlled and interpretable settings while isolating its contribution from other sources of algorithmic complexity. For this purpose, we intentionally selected two complementary benchmark problems representing different learning characteristics: (i) a sparse-reward discrete-state environment (grid-based foraging) and (ii) a continuous-state/discrete-action control environment (CartPole). These environments were chosen to evaluate whether the proposed mechanism remains beneficial across distinct reward structures and state-space properties rather than to demonstrate scalability to high-dimensional perception tasks.

We agree that the original wording in several parts of the manuscript may have suggested broader conclusions than those directly supported by the current experimental scope. Accordingly, we revised the manuscript to clarify the intended scope of the study and avoid overly broad scalability claims.

In the continuous-state experiment section, we revised the framing to emphasize that the CartPole experiments provide an initial validation beyond sparse-reward discrete environments rather than a comprehensive demonstration of generality and robustness.
In the Conclusion section, we added an explicit limitation statement clarifying that the current evaluation remains restricted to relatively low-dimensional benchmark environments and that the reported findings should be interpreted as an initial validation of the proposed social learning mechanism.
We also expanded the Future Work discussion to explicitly state that broader conclusions regarding scalability require evaluation in higher-dimensional, visual-input, and more complex control environments.

We believe these revisions better align the manuscript claims with the experimental evidence currently provided. These modifications can be found at page 22, paragraph 2, line 754 and page 29, paragraph 2, line 1062

Comment 2: The underlying reinforcement learning algorithm is limited to DQN, yet the paper repeatedly argues the framework generalizes to PPO, SAC, DDPG, and others, so at least one additional algorithm should be tested to empirically support this generality claim.

Response 2: We thank the reviewer for this important comment and agree that empirical evaluation with additional reinforcement learning algorithms would provide stronger evidence regarding the broader applicability of the proposed framework.

The objective of the present study, however, was not to establish empirical superiority or transferability across multiple reinforcement learning algorithms, but rather to introduce and analyze the proposed social learning mechanism under controlled conditions while isolating its contribution from changes in the underlying learning architecture. For this reason, DQN was intentionally selected as the reinforcement learning backbone because it provides a stable, interpretable, and consistently applicable framework across both the discrete-state and continuous-state experimental settings considered in this study.

We agree that some statements in the previous version of the manuscript may have suggested a broader level of algorithmic generalization than is directly supported by the current experimental evidence. Accordingly, we revised the manuscript to better align the scope of our claims with the conducted experiments.

We revised wording that could be interpreted as experimentally validated algorithmic generality and replaced it with more conservative language emphasizing modularity and conceptual compatibility.
We clarified that the proposed social learning layer was evaluated only in conjunction with DQN in the present study and that DQN was intentionally selected to isolate the contribution of the proposed mechanism.
We revised the Conclusion section to explicitly state that broader empirical validation across alternative reinforcement learning architectures remains future work rather than a demonstrated result of the current study.

Our intention is not to claim that the proposed framework has already been validated across algorithms such as PPO, SAC, DDPG, or other actor–critic methods. Rather, the contribution of this work is to demonstrate that adaptive social learning can function as a reinforcement learning enhancement mechanism within a controlled DQN-based setting and to provide a framework that may facilitate future integration with alternative reinforcement learning architectures. These modifications can be found at page 6, paragraph 4, line 272 and page 27, paragraph 5, line 997 and page 28, paragraph 5, line 1048.

Comment 3: The statistical analysis reports only significance intervals and p-values but omits effect sizes and a quantitative comparison of final converged performance, which should be added to more completely characterize the method's advantages.

Response 3: We thank the reviewer for this valuable observation and agree that statistical interpretation should be aligned with the intended performance criteria of the evaluated tasks.

In the present study, the statistical analysis was designed primarily to evaluate learning dynamics and learning efficiency rather than only final asymptotic performance. Accordingly, we reported confidence intervals across 100 independent runs together with paired t-tests, Wilcoxon signed-rank validation, and Benjamini–Hochberg correction to provide statistical support for the observed performance differences. Regarding final converged performance, we would like to clarify that this metric is not equally informative across all experimental settings considered in this study. In several discrete-state scenarios, the compared methods frequently converge to similar final solutions, while the primary contribution of SLDRL appears in faster policy development and earlier performance improvement during training. Therefore, comparing only final converged outcomes could underestimate the practical contribution of the proposed social learning mechanism.

To clarify this point, we added a brief statement in the revised manuscript indicating that, for the discrete-state experiments, performance differences should be interpreted primarily through learning efficiency and training dynamics rather than asymptotic final outcomes. These modifications can be found at page 13, paragraph 1, line 509.

Comment 4: Behavioral cloning is the only imitation-learning baseline used, while the stronger methods mentioned in the related work (such as GAIL and DIRL) are absent, so at least one modern imitation-learning baseline should be added to strengthen the comparison.

Response 4: We thank the reviewer for this valuable suggestion and agree that comparisons with additional imitation-learning approaches may provide further insight into the characteristics of the proposed framework.

However, the objective of the present study was not to benchmark SLDRL against the full spectrum of imitation-learning methods, but rather to isolate and evaluate the contribution of adaptive social learning under a controlled reinforcement learning setting. For this reason, Behavioral Cloning (BC) was selected as the primary imitation-learning baseline because it provides the most direct comparison against demonstration-driven behavior transfer while preserving assumptions that are closely aligned with the proposed framework. Methods such as GAIL and DIRL rely on substantially different learning assumptions and optimization objectives compared to SLDRL. In particular, these approaches introduce additional mechanisms such as reward inference, adversarial optimization, expert discrimination, or explicit policy alignment, which extend beyond the scope of evaluating selective behavioral reuse through ongoing reinforcement learning. As a result, incorporating such methods would require additional methodological considerations to ensure that comparisons remain meaningful and methodologically fair, rather than simply implementing additional baselines.

To clarify this distinction, we revised the manuscript to better explain the rationale behind selecting BC as the imitation-learning baseline and added discussion noting that comparisons with more advanced imitation-learning approaches remain an important direction for future work. Accordingly, we added a statement in the Conclusion section indicating that future studies may evaluate the proposed framework against adversarial and reward-inference-based imitation-learning methods, including GAIL and DIRL, under a broader imitation-learning evaluation setting. These modifications can be found at page 5, paragraph 2, line 201 and page 28, paragraph 5, line 1048.

Comment 5: （5）The "perfect imitation" assumption (exact, noise-free copying of action sequences) is unrealistic in practical settings, so the authors should add robustness experiments with observation noise or action perturbations.

Response 5: Thank you for this valuable comment. We agree that the assumption of perfect imitation adopted in the original simulation setup represents an idealized condition and does not fully reflect the challenges encountered in practical robotic or distributed learning environments.

To address this concern, we conducted an additional robustness analysis and included the results in Appendix B.5. In the new experiments, artificially perturbed behavioral demonstrations were introduced during imitation in order to simulate imperfect behavioral transfer. Specifically, random action perturbations with different noise levels were applied to the observed behavioral sequences before they were stored and later enacted by the learner agent, while all other training conditions were kept unchanged. The results show that introducing perturbations increases variability and reduces the effectiveness of social learning, particularly at higher noise levels. Nevertheless, the proposed SLDRL framework continued to demonstrate learning advantages over the pure DRL baseline under moderate levels of perturbation, suggesting that the selective social learning mechanism is not entirely dependent on perfectly transferred demonstrations.

At the same time, we explicitly emphasize in the revised manuscript that these experiments represent only a simplified approximation of noisy imitation and should not be interpreted as validation under real-world robotic conditions. Validation under physical robotic settings with imperfect sensing and actuation remains an important direction for future work and is now discussed more explicitly in the revised Conclusion section. These modifications can be found at page 10, paragraph 1, line 403 and Appendix B.5.

Reviewer 3 Report

Comments and Suggestions for Authors

Dear Authors,

Thank you for submitting your manuscript. While the idea of integrating social learning into deep reinforcement learning is interesting, the paper in its current form has several fundamental flaws that prevent acceptance. Below I summaries the key issues that must be addressed in a major revision, should you wish to resubmit.

Misalignment between claim and experiments.The title and abstract repeatedly refer to “Multi‑Agent Systems”, yet your experiments are predominantly single‑agent (CartPole) or involve at most two non‑interacting agents. The foraging task uses only one learning agent at a time; the other “agents” are pre‑defined trajectories or pre‑trained demonstrators without co‑learning or genuine interaction. This invalidates your core claim. You must either conduct proper multi‑agent experiments on standard benchmarks (e.g., SMAC, MPE, Overcooked with 3+ simultaneously learning agents) or revise the title and claims accordingly.
Overly simplistic environments.The 10×10 grid and CartPole are low‑dimensional, deterministic, fully observable toy problems. They lack partial observability, stochasticity, high‑dimensional observations, or multiple interacting agents. Such environments provide weak evidence for any claimed robustness or scalability. Replace them with challenging multi‑agent benchmarks that include these realistic complexities.
Poor technical presentation.Mathematical notation is incorrect (e.g., “Q_i−”, “Q − greedy” – use and -greedy). Equations lack punctuation and have missing parentheses. All figures (1–22) are low resolution, blurry, with illegible text and missing legends. Figure 4 (pseudocode) must be an algorithmic block or table, not an image. Rewrite the entire methodology with formally defined components, correct notation, and high‑quality vector graphics. Provide a proper algorithmic listing.
Lack of meaningful baselines and missing literature.You compare only pure DRL and behavioural cloning. Recent (2024–2025) MARL methods (MAPPO, QPLEX, MADDPG) and modern imitation learning baselines (GAIL, DAGGER, SQIL) are absent. No ablation study separates the entropy mechanism, imitate action, and enact action. Update the literature review and include these baselines. Add ablation experiments to validate each component.
Unclear definitions and insufficient validation.The “enactment mechanism” is not formally defined (is it exact replay or added experience?). The entropy‑based reinforcement signal lacks theoretical justification – lower entropy does not guarantee higher reward and can indicate premature convergence. Formally define every component: how demonstrations are collected, when imitation is triggered (entropy threshold), and how the imitation experience is incorporated into the Q‑update. Either justify the entropy signal or replace it with a reward‑based signal (e.g., advantage).
Introduction must be refined.The current introduction does not clearly articulate the motivation (why social learning is needed in multi‑agent RL beyond existing LfD methods) and the novelty (what exactly your framework adds that is not already covered by GAIL, DAGGER, or behavioral cloning). Rewrite the introduction to highlight the gap in existing MARL, the specific limitations of prior work, and how your entropy‑driven social learning uniquely addresses them.
System block diagram is inadequate.Figure 3 fails to show the data flow, including how entropy is calculated, how similarity matching is performed, and how the enactment mechanism interacts with the Q‑learning loop. Redesign the diagram to clearly illustrate all information pathways – observation → policy → entropy monitoring → demonstration retrieval → imitation advantage → experience replay → policy update.
Statistical and reproducibility concerns.You report 120 runs but do not provide mean ± standard deviation tables for final performance. Sensitivity analyses for hyperparameters (learning rate, discount factor, similarity threshold τ) are missing. Provide full statistical tables and sensitivity results. Ensure pseudocode is readable and complete.

Given the severity of these issues, I recommend rejection in the current form. A major revision addressing all points above – especially the multi‑agent experiments, corrected methodology, and proper presentation – would be necessary for any resubmission.

Sincerely,

Reviewer

Author Response

Comment 1: Misalignment between claim and experiments.The title and abstract repeatedly refer to “Multi‑Agent Systems”, yet your experiments are predominantly single‑agent (CartPole) or involve at most two non‑interacting agents. The foraging task uses only one learning agent at a time; the other “agents” are pre‑defined trajectories or pre‑trained demonstrators without co‑learning or genuine interaction. This invalidates your core claim. You must either conduct proper multi‑agent experiments on standard benchmarks (e.g., SMAC, MPE, Overcooked with 3+ simultaneously learning agents) or revise the title and claims accordingly.

Response 1: We thank the reviewer for this important observation. We agree that the original manuscript overstated the connection between the proposed framework and broader multi-agent reinforcement learning settings. The current study evaluates social learning through behavioral observation rather than simultaneous decentralized co-learning among multiple adaptive agents.

To address this concern, we revised the manuscript to consistently align the scope of the claims with the experimental validation. Specifically, we removed terminology suggesting validation in multi-agent systems and revised the title, keywords, and relevant sections of the manuscript to emphasize observational social learning and behavior reuse through demonstration rather than concurrent multi-agent learning.

We further clarified in the Conclusion section that the present results demonstrate the effectiveness of SLDRL under observational social-learning scenarios only. Extensions toward larger-scale multi-agent environments, decentralized interaction settings, and simultaneous co-learning remain directions for future work and are not claimed as validated in the current study. These modifications can be found at title, at keywords and page 28, paragraph 1, line 1048.

Comment 2: Overly simplistic environments.The 10×10 grid and CartPole are low‑dimensional, deterministic, fully observable toy problems. They lack partial observability, stochasticity, high‑dimensional observations, or multiple interacting agents. Such environments provide weak evidence for any claimed robustness or scalability. Replace them with challenging multi‑agent benchmarks that include these realistic complexities.

Response 2: We thank the reviewer for this valuable observation. We agree that the environments used in the present study (grid-based foraging and CartPole) represent controlled benchmark settings and do not capture important complexities such as partial observability, stochastic transitions, high-dimensional sensory inputs, or large-scale multi-agent interaction. However, the objective of the present work was not to demonstrate robustness or scalability across increasingly complex environments, but rather to isolate and evaluate the contribution of the proposed social-learning mechanism under controlled experimental conditions. Accordingly, the selected environments were intentionally retained to minimize confounding factors and allow direct examination of how socially acquired behaviors influence reinforcement learning performance.

To better align the manuscript scope with the presented validation, we revised the limitation and conclusion sections to further clarify that the reported findings should be interpreted as an initial validation of the proposed mechanism rather than evidence of scalability to partially observable, stochastic, high-dimensional, or more complex environments. We additionally emphasized that evaluation in higher-dimensional, visual-input, and more challenging environments remains an important direction for future work. These modifications can be found in the Conclusion section (page 28, paragraph 5, line 1043 and page 29, paragraph 2, line 1062).

Comment 3: Poor technical presentation.Mathematical notation is incorrect (e.g., “Q_i−”, “Q − greedy” – use and -greedy). Equations lack punctuation and have missing parentheses. All figures (1–22) are low resolution, blurry, with illegible text and missing legends. Figure 4 (pseudocode) must be an algorithmic block or table, not an image. Rewrite the entire methodology with formally defined components, correct notation, and high‑quality vector graphics. Provide a proper algorithmic listing.

Response 3: We thank the reviewer for these comments regarding the technical presentation of the manuscript. We carefully reviewed the manuscript and implemented several revisions to improve notation consistency, readability, and methodological presentation.

First, mathematical notation was systematically checked throughout the manuscript and formatting inconsistencies were corrected. In particular, notation related to stored transitions, action-selection expressions, and reinforcement learning variables was standardized and minor typographical issues were corrected where identified. Equation formatting and punctuation surrounding mathematical expressions were also reviewed to improve consistency and readability.

Second, all figures were rechecked to improve visual clarity and readability. Text sizes and figure formatting were revised where appropriate to ensure clearer presentation in the final manuscript.

Regarding Figure 4, we considered the reviewer’s suggestion to convert the pseudocode into a separate algorithm block or table format. After revision, we retained Figure 4 as a pseudocode-based figure representation because this format preserves the hierarchical control flow and nested decision structure of the SLDRL behavioral cycle while remaining consistent with the journal manuscript template. Nevertheless, Figure 4 was revised to improve visual quality and readability.

These revisions improve the clarity and reproducibility of the presentation while preserving the original methodological formulation of the proposed framework. No methodological changes to the SLDRL algorithm itself were required.

Comment 4: Lack of meaningful baselines and missing literature.You compare only pure DRL and behavioural cloning. Recent (2024–2025) MARL methods (MAPPO, QPLEX, MADDPG) and modern imitation learning baselines (GAIL, DAGGER, SQIL) are absent. No ablation study separates the entropy mechanism, imitate action, and enact action. Update the literature review and include these baselines. Add ablation experiments to validate each component.

Response 4: We thank the reviewer for this observation regarding baseline selection, literature coverage, and component validation.

First, regarding multi-agent reinforcement learning baselines (e.g., MAPPO, QPLEX, MADDPG), we revised the manuscript scope to remove broader multi-agent learning claims and repositioned the study as an evaluation of observational social learning rather than decentralized multi-agent reinforcement learning. Accordingly, direct comparison against MARL algorithms was not considered appropriate for the revised scope of the manuscript.

Second, regarding imitation-learning baselines, we expanded the Related Work section to better position SLDRL relative to contemporary imitation-learning approaches, including behavioral cloning (BC), inverse reinforcement learning (IRL), Generative Adversarial Imitation Learning (GAIL), learning from demonstrations (LfD), learning from observation (LfO), and related demonstration-driven methods. We retained behavioral cloning (BC) as the experimental imitation baseline because it enables direct comparison against non-selective reuse of demonstrated behaviors while preserving comparability with the proposed selective social-learning mechanism. Additional modern imitation-learning approaches were discussed conceptually, while acknowledging differences in assumptions regarding reward inference, expert supervision, and behavioral transfer.

Third, to address concerns regarding component contribution, we extended the analyses included in the manuscript by introducing an additional component ablation analysis (Appendix B.4). In addition to the previously included sensitivity analyses of SL-specific parameters (including action-sequence length, social-learning memory capacity, and similarity threshold), we evaluated two new component-level ablation settings: (i) a variant with the entropy-based reinforcement mechanism disabled and (ii) a variant with state-similarity matching disabled while preserving the underlying DQN training process.

The results show that removing either mechanism leads to a statistically significant reduction in learning performance relative to the complete SLDRL configuration. The reduction is more pronounced when entropy-based reinforcement is removed, indicating that adaptive behavioral evaluation contributes substantially to the effective reuse of socially acquired behaviors. At the same time, disabling state-similarity matching also produces a statistically significant performance decrease, demonstrating that contextual alignment contributes meaningfully to the selection and enactment of observed behaviors. Across all experiments, the complete SLDRL configuration achieved the highest learning efficiency.

We did not adopt complete removal of imitate or enact operations as ablation settings because these operations represent dependent execution stages of the proposed social-learning cycle rather than independent enhancement mechanisms. Removing imitate prevents behavioral acquisition, while removing enact prevents behavioral reuse, effectively eliminating social learning itself rather than isolating individual component contributions. Therefore, we focused the ablation analysis on internal decision mechanisms that regulate how socially acquired behaviors are evaluated and reused. These modifications can be found at page 4, paragraph 2, line 201, page 10, paragraph 1, line 403 and Appendix B.4.

Comment 5: Unclear definitions and insufficient validation.The “enactment mechanism” is not formally defined (is it exact replay or added experience?). The entropy‑based reinforcement signal lacks theoretical justification – lower entropy does not guarantee higher reward and can indicate premature convergence. Formally define every component: how demonstrations are collected, when imitation is triggered (entropy threshold), and how the imitation experience is incorporated into the Q‑update. Either justify the entropy signal or replace it with a reward‑based signal (e.g., advantage).

Response 5: We thank the reviewer for this comment. However, we respectfully note that several of the requested implementation details and component definitions were already formally described in the manuscript and have been retained in the revised version.

Specifically, the enactment mechanism is defined in the methodology section describing social learning integration, where socially acquired behavioral sequences are retrieved from the social-learning memory and executed step-by-step in the environment. During enactment, generated transitions are stored in the replay buffer and incorporated into the standard DQN learning procedure in the same manner as ordinary environment interactions (Section 2.3, page 9, lines 312–340).

Similarly, the manuscript already specifies that imitation and enactment are not triggered through an entropy threshold. Instead, these operations are modeled as additional actions available to the agent and selected through the same action-selection mechanism used for primitive actions (Section 3, page 7-9,).

Regarding demonstration collection and storage, the manuscript describes that observed behavioral sequences are collected during imitation episodes, stored in the social-learning memory together with associated state information, and later reused through the enactment process (Section 3, page 7–9).

Regarding the entropy-based reinforcement signal, we agree that reduced entropy does not theoretically guarantee higher external reward. However, the proposed entropy formulation was not introduced as a reward surrogate. As described in the manuscript, entropy is used as an intrinsic behavioral evaluation mechanism intended to estimate action certainty and regulate selective reuse of socially acquired behaviors rather than directly optimize external return (Section 3, Equation (2), page 8).

To further strengthen empirical support for this design choice, we expanded the ablation analysis and added an additional experimental setting in which the entropy-based behavioral reinforcement mechanism was disabled. This allows direct evaluation of the contribution of the entropy mechanism within SLDRL without changing the intended design assumptions of the proposed framework.

For these reasons, we did not replace the entropy signal with reward-based or advantage-based alternatives, as such a modification would fundamentally alter the objective and behavioral selection mechanism of SLDRL.

Comment 6: Introduction must be refined.The current introduction does not clearly articulate the motivation (why social learning is needed in multi‑agent RL beyond existing LfD methods) and the novelty (what exactly your framework adds that is not already covered by GAIL, DAGGER, or behavioral cloning). Rewrite the introduction to highlight the gap in existing MARL, the specific limitations of prior work, and how your entropy‑driven social learning uniquely addresses them.

Response 6: We thank the reviewer for this observation regarding the motivation and positioning of the proposed framework relative to existing demonstration-based learning approaches.

Following the revised scope of the manuscript, the present study is no longer positioned as a contribution to multi-agent reinforcement learning. Therefore, the Introduction was not restructured around limitations of multi-agent reinforcement learning methods. Instead, we revised the Introduction and Related Work sections to more clearly articulate the intended motivation and novelty of the proposed framework within the context of observational social learning and imitation-based reinforcement learning.

We expanded the discussion comparing SLDRL with existing demonstration-driven approaches and clarified the distinctions between behavioral cloning (BC), imitation learning, and adversarial imitation approaches. The revised text now explicitly states that BC directly reproduces demonstrated state–action mappings, GAIL attempts to learn expert-like policies through adversarial optimization, and DAGGER relies on iterative aggregation of expert supervision. In contrast, SLDRL treats socially observed behaviors as reusable behavioral candidates whose usefulness is evaluated through subsequent interaction with the environment.

We additionally clarified that socially acquired behaviors are not directly enforced but are selectively observed and enacted under the learner’s own decision process while maintaining an adaptive balance between individual and social learning. The revised text further emphasizes that the proposed entropy-based mechanism is used as an intrinsic behavioral evaluation process rather than as a replacement for reward optimization. These modifications can be found at page 3, paragraph 1, line 123 and page 5, paragraph 2-3, line 201.

Comment 7: System block diagram is inadequate.Figure 3 fails to show the data flow, including how entropy is calculated, how similarity matching is performed, and how the enactment mechanism interacts with the Q‑learning loop. Redesign the diagram to clearly illustrate all information pathways – observation → policy → entropy monitoring → demonstration retrieval → imitation advantage → experience replay → policy update.

Response 7: We thank the reviewer for this suggestion regarding Figure 3.

We respectfully note that Figure 3 was intended as a high-level architectural overview illustrating the interaction between the Deep Reinforcement Learning (DRL) layer and the Social Learning (SL) layer, rather than as a complete block diagram of all algorithmic operations. The detailed procedural flow of the proposed framework, including imitation, enactment, behavioral storage, entropy-based evaluation, transition storage, and DQN update steps, is already provided separately in the pseudocode representation in Figure 4 and in the accompanying methodological description.

To avoid unnecessary duplication between the architecture figure and the pseudocode, we did not redesign Figure 3 as a full algorithmic block diagram. However, we revised Figure 3 to improve clarity and make the interaction between the two layers more explicit. In particular, the updated figure now indicates the main functions of the DRL layer and the SL layer, including action selection and Q-learning in the DRL layer, and behavior memory, similarity matching, and entropy evaluation in the SL layer. The interaction arrows were also clarified to show imitate, enact/retrieve behavior, demonstrated behavior input, current state input, and other primitive actions.

Comment 8: Statistical and reproducibility concerns.You report 120 runs but do not provide mean ± standard deviation tables for final performance. Sensitivity analyses for hyperparameters (learning rate, discount factor, similarity threshold τ) are missing. Provide full statistical tables and sensitivity results. Ensure pseudocode is readable and complete.

Response 8: We thank the reviewer for this comment regarding statistical reporting and reproducibility.

We respectfully note that the manuscript already reports aggregated performance across repeated independent runs using mean learning curves with 95% confidence intervals, together with statistical significance analyses. Since the main objective of the study is to evaluate learning dynamics and sample efficiency rather than only final converged performance, final-performance-only tables were not adopted as the primary reporting format.

Regarding sensitivity analysis, we note that the manuscript already included sensitivity analyses for SLDRL-specific parameters, including action-sequence length, social-learning memory capacity, and the state-similarity threshold (\tau). These analyses were selected because they directly correspond to the proposed social-learning mechanism.

In response to the reviewer’s broader concern about component-level validation, we further expanded the experimental analysis by adding ablation settings in which the entropy-based behavioral reinforcement mechanism and the state-similarity-based behavioral selection mechanism were disabled. These additional experiments complement the existing sensitivity analyses and provide a more direct evaluation of the contribution of the core SLDRL components.

Regarding learning rate and discount factor sensitivity, we did not perform full sweeps over these general DQN hyperparameters because they belong to the underlying DQN implementation rather than the proposed social-learning layer. Instead, we focused the sensitivity and ablation analyses on parameters and mechanisms directly associated with SLDRL, while reporting the main DQN hyperparameters explicitly for reproducibility.

Finally, the pseudocode representation was revised to improve readability and completeness, including clearer notation for imitate, enact, transition storage, entropy-based behavioral update, and the standard DQN update process. New additions to ablation studies can be found at Appendix B.4 and Appendix B.5.

Reviewer 4 Report

Comments and Suggestions for Authors

This manuscript presents a novel Social Learning-Enhanced Deep Reinforcement Learning (SLDRL) framework that integrates social learning mechanisms into DRL to improve agent performance in both discrete and continuous environments. The hybrid architecture enables adaptive selection of observed behaviors via an entropy-based intrinsic motivation mechanism, with experiments in foraging tasks and CartPole showing faster learning and higher rewards compared to pure DRL and behavioral cloning baselines. The work contributes to multi-agent reinforcement learning by demonstrating robust social learning without requiring expert demonstrators.

The entropy-based intrinsic motivation is central to SLDRL, but the manuscript does not explicitly justify why action-based entropy (Equation 2) is an appropriate metric for evaluating behavioral utility. Please elaborate on how entropy reduction correlates with improved task performance, ideally with empirical evidence from control experiments.
The current experiments focus on single-learner settings with either predefined trajectories or pairwise observation. How does SLDRL scale to larger multi-agent systems with varying expertise levels or competitive dynamics? Addressing this would strengthen claims of real-world applicability.
The hybrid architecture combines imitation/enact meta-actions, entropy-based filtering, and state-similarity matching. Which component drives the performance gains? Ablation studies isolating these elements are necessary to validate their individual contributions.
The manuscript mentions dynamic balance between social and individual learning, but there is no theoretical analysis of convergence or stability. Does SLDRL guarantee policy improvement under certain conditions? A brief discussion of theoretical properties would enhance rigor.

Author Response

Comment 1: The entropy-based intrinsic motivation is central to SLDRL, but the manuscript does not explicitly justify why action-based entropy (Equation 2) is an appropriate metric for evaluating behavioral utility. Please elaborate on how entropy reduction correlates with improved task performance, ideally with empirical evidence from control experiments.

Response 1: We thank the reviewer for this important observation regarding the motivation and validation of the entropy-based intrinsic evaluation mechanism.

We agree that the rationale behind using action-based entropy as an internal behavioral evaluation signal required further clarification. To address this concern, we expanded the explanation in Section 4 and clarified the intended role of entropy in the proposed framework.

In SLDRL, action-based entropy is not used as a direct proxy for external task reward or final task success. Instead, it is used as an internal indicator of policy uncertainty. The underlying intuition is that, if enacting an observed behavior leads to more differentiated Q-value estimates among available actions in the visited states, the agent’s action preference becomes less random and more structured. This manifests as a reduction in action-based entropy and indicates that the socially acquired behavior contributes useful information to the current learning process. Accordingly, the entropy-based mechanism does not directly reinforce behaviors because they produce higher rewards; rather, it increases the future reuse probability of behaviors that reduce uncertainty in action selection. This design was selected to preserve compatibility with sparse-reward environments, where immediate reward signals may be unavailable or delayed.

To further validate the contribution of this mechanism, we additionally introduced a component ablation analysis (Appendix B.4) in which the entropy-based reinforcement mechanism was removed while preserving the remaining SLDRL architecture and DQN learning process. The results show that disabling entropy-based reinforcement leads to a statistically significant reduction in learning performance relative to the complete SLDRL configuration. This result provides empirical evidence that entropy-guided behavioral evaluation contributes positively to learning efficiency and supports the use of entropy reduction as an intrinsic criterion for selecting socially acquired behaviors. These modifications can be found at page 10, paragraph 1, line 403 and Appendix B.4.

Comment 2: The current experiments focus on single-learner settings with either predefined trajectories or pairwise observation. How does SLDRL scale to larger multi-agent systems with varying expertise levels or competitive dynamics? Addressing this would strengthen claims of real-world applicability.

Response 2: We thank the reviewer for this valuable observation. We agree that evaluating social learning under larger multi-agent settings with heterogeneous expertise levels and competitive interactions represents an important future research direction. In the revised manuscript, however, we intentionally narrowed the scope of the work to avoid overstating the applicability of the proposed method. Accordingly, we revised the title, abstract, introduction, and discussion to position SLDRL as a social learning enhancement mechanism for reinforcement learning through behavioral observation, rather than as a validated multi-agent reinforcement learning framework.

The current study focuses on controlled observational learning settings involving a single learning agent that selectively acquires and reuses behaviors demonstrated by others. This design was intentionally chosen to isolate and evaluate the contribution of the proposed social learning mechanism before introducing additional complexity arising from large-scale multi-agent coordination, heterogeneous populations, or competitive dynamics. To reflect this limitation more explicitly, we added clarification in the revised manuscript stating that scalability to larger populations, varying expertise distributions, cooperative–competitive settings, and decentralized multi-agent environments remains outside the scope of the present study and is identified as an important direction for future work. These modifications can be found at title, at keywords and page 29, paragraph 3, line 1075.

Comment 3: The hybrid architecture combines imitation/enact meta-actions, entropy-based filtering, and state-similarity matching. Which component drives the performance gains? Ablation studies isolating these elements are necessary to validate their individual contributions.

Response 3: We thank the reviewer for this valuable suggestion. We agree that understanding the contribution of the main components of SLDRL is important for interpreting the observed performance gains.

In the revised manuscript, we expanded the analyses presented in Appendix B by introducing an additional component ablation analysis (Appendix B.4) to evaluate the contribution of individual SLDRL mechanisms. Specifically, we evaluated two ablated variants of the proposed framework: (i) a variant with the entropy-based reinforcement mechanism disabled and (ii) a variant with state-similarity matching disabled while preserving the same DQN architecture, training procedure, and experimental setup used in the main experiments.

We note that imitate and enact were not evaluated as fully independent modules because they constitute structurally interdependent stages of the proposed social-learning cycle. The imitate action is responsible for acquiring behavioral demonstrations, whereas enact provides the mechanism through which stored behaviors influence subsequent learning. Disabling either operation would effectively eliminate the social-learning process itself rather than isolate the contribution of an internal decision mechanism. Therefore, the ablation analysis focused on mechanisms that can be selectively removed while preserving the overall SLDRL architecture and allowing meaningful comparison.

The additional results show that removing either entropy-based reinforcement or state-similarity matching leads to a statistically significant reduction in learning performance relative to the complete SLDRL configuration. The reduction is more pronounced when entropy-based reinforcement is removed, indicating that adaptive behavioral evaluation plays a stronger role in determining which socially acquired behaviors are reused. At the same time, disabling state-similarity matching also produces a statistically significant performance decrease, demonstrating that contextual alignment contributes meaningfully to the effective retrieval and enactment of observed behaviors.

Importantly, because SLDRL is designed as a social-learning enhancement layer operating on top of DQN rather than a standalone replacement for reinforcement learning, removing individual social-learning mechanisms is not expected to prevent learning entirely. Instead, the purpose of the ablation analysis is to quantify the contribution of each mechanism to learning efficiency. Across all experiments, the complete SLDRL configuration consistently achieved the highest performance. These modifications can be found at page 10, paragraph 1, line 403 and Appendix B.4.

Comment 4: The manuscript mentions dynamic balance between social and individual learning, but there is no theoretical analysis of convergence or stability. Does SLDRL guarantee policy improvement under certain conditions? A brief discussion of theoretical properties would enhance rigor.

Response 4: We thank the reviewer for this important observation. We agree that clarifying the theoretical interpretation of the proposed framework improves the rigor and positioning of the study.

In the revised manuscript, we expanded the discussion section to explicitly clarify the theoretical scope and limitations of SLDRL. We emphasize that the proposed framework does not claim formal convergence or policy-improvement guarantees. Since SLDRL is implemented on top of DQN with nonlinear function approximation, formal convergence guarantees are generally not available even for the underlying learning algorithm.

To clarify the role of the proposed social learning mechanism, we added discussion explaining that SLDRL operates as a modular enhancement layer and does not introduce a separate optimization objective or modify the underlying Bellman update procedure. Instead, socially acquired behaviors influence learning indirectly by altering the distribution of experiences encountered during training through selective observation and enactment. Accordingly, the reported performance gains should be interpreted as empirical evidence under the evaluated experimental conditions rather than formal guarantees of convergence or stability. We additionally identified formal theoretical analysis of the interaction between social learning and replay-based deep reinforcement learning as an important direction for future work. These modifications can be found at page 27, paragraph 5, line 997.

Round 2

Reviewer 3 Report

Comments and Suggestions for Authors

Dear Authors,

Thank you for your efforts in revising the manuscript. However, after a thorough review, several significant issues remain that prevent its acceptance. My detailed concerns are outlined below.

The most critical problem lies in the formalization of your methodology, which is not yet at a publishable standard. Several core components are vaguely defined or lack proper justification. For instance, the choice of "5 time steps" for the agent to remain stationary during observation appears arbitrary, and the evidence in Appendix Figure 20 does not convincingly show a significant difference from lengths of 2 or 10. A robust justification is needed for this key design choice. Furthermore, Equation 2 is incomplete; the calculation of b_decrease must be explicitly defined rather than just described in prose. There is also a fundamental disconnect in notation: the concept of "specific behavior b" lacks a formal definition, and the variable ProbSelect_b in the text is not strictly equivalent to ProbSelect(b) in the pseudocode, causing confusion. The pseudocode itself (Figure 4) is not acceptable. It should be a complete, readable algorithmic listing, not an image, and must clearly show how its logical branches, such as the IF statement in line 6, are triggered.

The presentation of your framework’s overview is also inadequate. The system block diagram (Figures 2 and 3) must be integrated into a single, professional figure that clearly delineates all information pathways—from observation and entropy calculation to behavior retrieval and policy update. The current version fails to help readers form a clear mental model of the SLDRL framework. Additionally, the Introduction still falls short. It must be refined to explicitly and concisely list the paper's contributions in a way that highlights its novelty against existing work.

Finally, the experimental validation remains weak. Using only one standard control environment (CartPole) is insufficient to support your claims. The experimental foundation must be significantly broadened with additional, more challenging environments to demonstrate the generalizability and robustness of your approach. Several minor formatting issues, such as unnecessary indentation on line 313 and missing punctuation after all equations, should also be corrected throughout the manuscript.

These issues collectively represent a fundamental lack of clarity and rigor. A major restructuring of the methodology and a substantial extension of the experimental validation are mandatory before the manuscript can be considered for publication.

Sincerely,

Author Response

Comment 1: The most critical problem lies in the formalization of your methodology, which is not yet at a publishable standard. Several core components are vaguely defined or lack proper justification. For instance, the choice of "5 time steps" for the agent to remain stationary during observation appears arbitrary, and the evidence in Appendix Figure 20 does not convincingly show a significant difference from lengths of 2 or 10. A robust justification is needed for this key design choice. Furthermore, Equation 2 is incomplete; the calculation of b_decrease must be explicitly defined rather than just described in prose. There is also a fundamental disconnect in notation: the concept of "specific behavior b" lacks a formal definition, and the variable ProbSelect_b in the text is not strictly equivalent to ProbSelect(b) in the pseudocode, causing confusion. The pseudocode itself (Figure 4) is not acceptable. It should be a complete, readable algorithmic listing, not an image, and must clearly show how its logical branches, such as the IF statement in line 6, are triggered.

Response 1: We thank the reviewer for this detailed comment. We agree that the formal presentation of the methodology is important, and we have revised the manuscript to further improve clarity, notation consistency, and algorithmic readability.

First, regarding the 5-step observation window, we respectfully clarify that this value was not selected arbitrarily. A dedicated sensitivity analysis was already provided in Appendix B.1 using action-sequence lengths of 2, 5, and 10 under identical experimental conditions. The purpose of this analysis was not to establish a universal theoretical optimum, but to evaluate whether the proposed SLDRL framework is sensitive to this design parameter. The results show that all tested sequence lengths produce similar qualitative learning dynamics and converge to comparable final performance levels, indicating that the framework is reasonably robust within the tested range. Although the observed performance differences are relatively modest, the 5-step configuration provides the most favorable overall performance among the tested alternatives and shows slightly faster learning during several stages of training. Therefore, it was selected as the default configuration used in the main experiments as a practical compromise between behavioral informativeness and contextual robustness. To avoid overstating this result, we revised the corresponding discussion in Appendix B.1 accordingly.

Second, we revised the entropy-based behavior evaluation mechanism to define (b_{decrease}) more explicitly. Although the previous manuscript described this mechanism in prose, we now provide a numbered equation defining when (b_{decrease}) is incremented. The revised text also explains (Entropy_{before}), (Entropy_{after}), and the visited-state set (S_b), thereby making the update rule fully explicit.

Third, we clarified the meaning of a specific behavior (b). In the revised manuscript, (b) is explicitly defined as a stored observed action sequence together with the corresponding observation start state. This clarification was added before the entropy-based behavior evaluation equations, where (b), (b_{decrease}), (b_{enacted}), and (ProbSelect_b) are used.

Fourth, we addressed the notation issue concerning (ProbSelect_b). The notation in the algorithmic listing has been revised to be fully consistent with the formal equation. The manuscript now consistently uses (ProbSelect_b), and no (ProbSelect(b)) notation remains in the algorithmic listing or related text.

Finally, in response to the reviewer’s concern about the pseudocode presentation, the previous Figure 4 has been converted into a text-based algorithmic listing, now presented as Algorithm 1. The algorithm explicitly shows the action-selection step and all logical branches. In particular, the triggering mechanism is shown in line 5, where (a_t) is selected using the ε-greedy DQN policy. The imitate branch is then triggered in line 6 if (a_t=) imitate, the enact branch is triggered in line 13 if (a_t=) enact, and the primitive-action branch is handled in line 24 through the else condition. Thus, the revised Algorithm 1 explicitly presents the complete branching structure of the SLDRL behavioral cycle.

These modifications can be found page 9, paragraph 1, line 351, page 9, paragraph 4, line 390, page 11, paragraph 1, line 448, and page 31, paragraph 3, line 1156.

Comment 2: The presentation of your framework’s overview is also inadequate. The system block diagram (Figures 2 and 3) must be integrated into a single, professional figure that clearly delineates all information pathways—from observation and entropy calculation to behavior retrieval and policy update. The current version fails to help readers form a clear mental model of the SLDRL framework. Additionally, the Introduction still falls short. It must be refined to explicitly and concisely list the paper's contributions in a way that highlights its novelty against existing work.

Response 2: We thank the reviewer for these suggestions regarding the presentation and positioning of the proposed framework.

Regarding the system overview, we respectfully chose to preserve Figures 2 and 3 as separate figures because they serve different explanatory purposes and operate at different abstraction levels. Figure 2 presents the internal architecture and optimization mechanism of the underlying DQN component, whereas Figure 3 focuses specifically on the proposed SLDRL extension and illustrates the interaction between the DRL and Social Learning layers, including behavior storage, retrieval, similarity matching, entropy-based behavior evaluation, and behavior selection. Combining both figures into a single diagram would substantially increase visual complexity by mixing low-level DQN optimization details with the higher-level social learning workflow. Nevertheless, to improve clarity, we revised the caption and description of Figure 3 to more explicitly emphasize the information flow and interaction between observation, behavior evaluation, behavior selection, and policy update processes, while clarifying that the internal DQN optimization process is separately detailed in Figure 2.

Regarding the Introduction, we respectfully note that the manuscript already contains an extended discussion of the novelty and contributions of the proposed method in the final paragraphs of the Introduction section. In particular, the revised Introduction explicitly describes: (i) the restriction of knowledge transfer to externally observable behaviors rather than internal parameters or reward signals, (ii) the preservation of independent reinforcement learning updates without access to demonstrator internals, (iii) the autonomous selection of when and how socially acquired behaviors are utilized, and (iv) the entropy-based intrinsic evaluation mechanism used to regulate future behavior reuse. To make these contributions more immediately visible to readers, we further refined the corresponding discussion and added a concise summary sentence emphasizing how these design choices distinguish SLDRL from conventional demonstration-based and imitation-based reinforcement learning approaches, while avoiding unnecessary repetition of material already discussed in detail in the Introduction. These modifications can be found page 8, caption of figure 3, and page 3, paragraph 1, line 122.

Comment 3: Finally, the experimental validation remains weak. Using only one standard control environment (CartPole) is insufficient to support your claims. The experimental foundation must be significantly broadened with additional, more challenging environments to demonstrate the generalizability and robustness of your approach. Several minor formatting issues, such as unnecessary indentation on line 313 and missing punctuation after all equations, should also be corrected throughout the manuscript.

Response 3: We thank the reviewer for the comments regarding experimental validation and manuscript presentation.

Regarding experimental diversity, we respectfully clarify that the current manuscript evaluates SLDRL in multiple reinforcement learning settings rather than a single standard control environment. Specifically, the experimental evaluation includes: (i) a sparse-reward discrete-state grid-based foraging environment designed to investigate socially guided exploration under delayed feedback conditions, and (ii) a dense-reward continuous-state/discrete-action CartPole environment used to examine the applicability of the proposed mechanism under a different state representation and reward structure. In addition, the study includes multiple experimental scenarios and analyses, including predefined behavioral demonstrations, expert observation, observation of experienced and inexperienced agents, behavioral cloning baselines, component ablation experiments, robustness analysis under noisy imitation, and sensitivity analysis of key social learning parameters.

That said, we agree that evaluating the proposed framework in additional environments could provide further evidence regarding broader applicability. However, the objective of the present study is not exhaustive benchmarking across a large collection of reinforcement learning environments, but rather the introduction and initial validation of the proposed social learning mechanism under controlled conditions with substantially different environmental characteristics. To avoid overstating the scope of the conclusions, the manuscript already explicitly discusses this limitation and identifies broader validation in larger-scale and more complex environments as an important direction for future research.

Finally, following the reviewer’s suggestion, we carefully reviewed the manuscript formatting and corrected minor presentation inconsistencies throughout the document, including paragraph alignment and equation formatting where necessary.

Comment 4: These issues collectively represent a fundamental lack of clarity and rigor. A major restructuring of the methodology and a substantial extension of the experimental validation are mandatory before the manuscript can be considered for publication.

Response 4: We thank the reviewer for the overall assessment and carefully considered the concerns regarding methodological clarity and experimental scope.

Following the reviewer’s comments, we revised the manuscript extensively to improve clarity, formal consistency, readability, and presentation throughout the methodology section. In particular, all concrete methodology-related concerns raised in the review were addressed in the revised manuscript. These revisions include additional justification of design choices, explicit formalization of previously text-described update mechanisms, clarification of notation and behavioral definitions, improved algorithm presentation, refinement of framework visualization, and revision of the associated explanatory text.

We note that these revisions primarily concern methodological presentation and formal clarity rather than changes to the underlying SLDRL mechanism itself. After implementing the requested revisions, we respectfully do not believe that a major restructuring of the methodology is necessary for the present study, as the core algorithmic design and experimental protocol remain unchanged and the reviewer’s actionable concerns have been addressed directly.

Regarding experimental validation, the revised manuscript already includes evaluation in both discrete-state sparse-reward and continuous-state environments, together with additional analyses including behavioral cloning comparisons, component ablation, sensitivity analysis, and robustness experiments under noisy imitation. We therefore believe that the revised manuscript provides an appropriate initial validation of the proposed framework while acknowledging that broader validation across larger-scale and more complex environments remains an important direction for future work.

We hope these revisions sufficiently address the reviewer’s concerns and improve the overall clarity, readability, and presentation of the manuscript.

Reviewer 4 Report

Comments and Suggestions for Authors

The revision is good. I have no further comments.

Author Response

Comment 1: The revision is good. I have no further comments.

Response 1: We sincerely thank the reviewer for carefully evaluating the revised manuscript and for the positive assessment. We appreciate the reviewer’s recognition of the revisions and improvements made throughout the review process. We are pleased that the revised version addresses the previous concerns, and we thank the reviewer for the constructive feedback that helped improve the clarity and overall quality of the manuscript.

Round 3

Reviewer 3 Report

Comments and Suggestions for Authors

Dear Authors,

Thank you for your detailed response. While some clarifications have been made, the manuscript still suffers from fundamental problems in presentation, formal rigour, and experimental validation. My remaining concerns are outlined below.

The system overview remains inadequate. Your decision to keep Figures 2 and 3 separate does not resolve the original complaint; the presentation is still fragmented, and neither figure provides a clear, end-to-end mental model of the SLDRL framework. The information pathways from observation to policy update are not coherently illustrated. A single, professionally designed block diagram that integrates all key components is essential for readability.

The formal quality of the methodology is still below the required standard. For instance, Equation 2 is written as “−Σ(𝑎 ∈ 𝐴)”, which contains incorrect summation notation and ambiguously formatted variables. Proper mathematical typesetting (e.g., ) and clearly defined terms are mandatory for a scholarly publication. This lack of rigour pervades the section and must be thoroughly corrected.

On the experimental side, your defence that the selected environments are sufficient does not alleviate my concern about generalisability. You argue that more challenging benchmarks are not suitable for your study’s objective, yet if a method cannot be demonstrated on any standard, moderately complex benchmark from the numerous available in the field, its claimed broad applicability is not credible. The current validation on a simple grid world and CartPole is too narrow. At least one additional, more realistic environment is needed to support your claims.

Given these persistent issues, the manuscript requires a further major revision that fundamentally improves presentation rigour, delivers a readable system diagram, corrects all mathematical notation, and strengthens experimental evidence of generalisation.

Sincerely,

Reviewer

Author Response

Comment 1: The system overview remains inadequate. Your decision to keep Figures 2 and 3 separate does not resolve the original complaint; the presentation is still fragmented, and neither figure provides a clear, end-to-end mental model of the SLDRL framework. The information pathways from observation to policy update are not coherently illustrated. A single, professionally designed block diagram that integrates all key components is essential for readability.

Response 1: Thank you for this valuable suggestion. To improve the presentation of the proposed framework and provide a clearer end-to-end overview of the learning process, we have added a new high-level block diagram (Figure 4) illustrating the overall interaction between the DQN controller, the Social Learning module, and the environment throughout the complete learning cycle.

While Figures 2 and 3 describe the internal architectures of the DQN and Social Learning components separately, the newly added block diagram provides an integrated conceptual view of how these components interact during the complete SLDRL operation. The diagram summarizes the overall information flow from state observation and action selection to action execution, environment feedback, and policy update, thereby providing the end-to-end system overview requested by the reviewer.

To further improve readability, a short explanatory paragraph has been added immediately before Algorithm 1 to introduce the new block diagram and clarify its relationship with the detailed architectural figures. These modifications can be found at page 9, paragraph 1, line 351, page 10, paragraph 2, line 434 and page 11, Fig. 4.

Comment 2: The formal quality of the methodology is still below the required standard. For instance, Equation 2 is written as “−Σ(? ∈ ?)”, which contains incorrect summation notation and ambiguously formatted variables. Proper mathematical typesetting (e.g., ) and clearly defined terms are mandatory for a scholarly publication. This lack of rigour pervades the section and must be thoroughly corrected.

Response 2: Thank you for this comment. To improve mathematical clarity and presentation, we revised Equation (2) using standard mathematical typesetting. In particular, the summation notation has been reformatted using the conventional indexed form (∑a∈A), and the surrounding mathematical expressions have been carefully reviewed to ensure consistent notation and improved readability. In addition, we reviewed the methodology section and refined the formatting of equations and variable definitions where appropriate to improve consistency throughout the manuscript.

Comment 3: On the experimental side, your defence that the selected environments are sufficient does not alleviate my concern about generalisability. You argue that more challenging benchmarks are not suitable for your study’s objective, yet if a method cannot be demonstrated on any standard, moderately complex benchmark from the numerous available in the field, its claimed broad applicability is not credible. The current validation on a simple grid world and CartPole is too narrow. At least one additional, more realistic environment is needed to support your claims.

Response 3: Thank you for this thoughtful comment. We agree that demonstrating broad generalization across a large collection of benchmark environments is an important long-term research direction. However, we respectfully believe that such an evaluation represents a substantially different scope from the objective of the present manuscript.

The primary purpose of this work is to introduce the proposed SLDRL framework, describe its implementation in detail, justify its design choices through comprehensive ablation studies and robustness analyses, and demonstrate its applicability in both discrete and continuous reinforcement learning settings. Accordingly, the experimental section includes extensive analyses of the proposed components, including parameter sensitivity, component ablation, robustness under imperfect behavioral demonstrations, and validation in two fundamentally different state-action representations.

A comprehensive evaluation of generalization across numerous benchmark environments would require careful adaptation of the framework to each task, appropriate implementation choices, extensive hyperparameter tuning, and independent experimental analyses for every environment. Although adding a single additional benchmark might partially extend the evaluation, it would not constitute convincing evidence of broad generalization, which would instead require systematic validation across multiple representative environments.

Such an investigation would constitute a substantial study in its own right and would considerably exceed the intended scope of the present manuscript. Moreover, incorporating a sufficiently comprehensive generalization study, together with the associated implementation details and analyses, would substantially increase the length of the manuscript beyond what is appropriate for a single research article.

For this reason, we have deliberately defined the current contribution as the introduction and detailed validation of the proposed framework rather than an exhaustive benchmark study. We also explicitly identify large-scale evaluations on additional and more complex benchmark environments as an important direction for future work in the revised manuscript.

We therefore respectfully maintain the current experimental scope, which we believe is appropriate for the objectives of this study while providing detailed validation of the proposed methodology.

Article Menu

Social Learning-Enhanced Deep Reinforcement Learning Through Behavioral Observation

Further Information

Guidelines

MDPI Initiatives

Follow MDPI