Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Facilitating Robot Learning in Virtual Environments: A Deep Reinforcement Learning Framework

Appl. Sci. 2025, 15(9), 5016; https://doi.org/10.3390/app15095016

by Algirdas Laukaitis^*

, Andrej Šareiko and Dalius Mažeika

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Appl. Sci. 2025, 15(9), 5016; https://doi.org/10.3390/app15095016

Submission received: 30 March 2025 / Revised: 25 April 2025 / Accepted: 29 April 2025 / Published: 30 April 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This article introduces a framework for applying deep reinforcement learning (DRL) in the Webots simulation environment, aimed at facilitating the transfer of trained models to physical robots. It presents three design patterns, a digital twin methodology, and experiments. However, the manuscript would benefit from improved clarity in presenting its novelty and more rigorous comparisons to related work.

Here are some comments:

The abstract is informative but should clearly state the novelty over prior frameworks (e.g., Deepbots, Gym-Ignition).
Clarify how the proposed interface for Webots improves upon Deepbots and existing Gym wrappers
Was domain randomization used during training? If so, provide metrics that show its impact on policy robustness.
Recent and relevant, though more citations are needed for the RL transfer learning and domain randomization claims.
The work is promising and technically sound but requires clearer presentation of novelty, more experimental rigor, and improved language quality.

Author Response

Dear Reviewer,

Thank you for your constructive feedback on our paper. We appreciate your valuable comments, particularly concerning the clarity of our framework's novelty and the need for more rigorous comparisons to related work.

We agree that explicitly highlighting our contribution over existing frameworks is essential for clarity. To address this, we have revised the abstract and paper chapters to more clearly articulate the novelty of our proposed framework and method by directly referencing and differentiating it from related efforts such as Deepbots and Gym-Ignition.

Specifically, our revised paper now clarifies that while frameworks like Deepbots exist, they often present challenges related to complex external package dependencies and reliance on unrealistic 'supervisor' program for environment control – issues our framework explicitly addresses through a novel pattern-based approach that simplifies setup and a methodological pattern that removes the supervisor dependency, making it more suitable for practical, real-world robot applications.

Furthermore, we differentiate our work from frameworks like Gym-Ignition by highlighting that our solution provides an actively maintained open-source option tailored to leverage the realistic physics and detailed robot modeling capabilities of simulators like Webots, in contrast to Gym-Ignition's current status of inactive maintenance and broader focus.

Thank you for your suggestion to clarify the advantages of our proposed interface over existing solutions like Deepbots and Gym wrappers. We have revised the relevant section 2 and 3 to explicitly address this. Our proposed framework offers several key improvements: Updated Standard & Dependency Management: Unlike Deepbots, which is built upon the older, deprecated OpenAI Gym library, our interface utilizes the current OpenAI Gymnasium standard. This ensures compatibility with modern reinforcement learning algorithms and libraries. Furthermore, Deepbots requires installing a separate Python package with dependencies that often conflict with newer libraries, complicating setup. Our framework avoids these specific package dependencies and associated versioning issues. Reduced Supervisor Dependency: A critical architectural difference is that our framework removes the necessity of relying heavily on the Webots 'Supervisor' node for the fundamental agent-environment interaction loop (observation, action, reward). While Deepbots and similar approaches often require intricate Supervisor logic, our method simplifies this, making the integration of RL agents more straightforward and potentially more generalizable across different simulation environments. Lack of Comparable Alternatives: To our knowledge, Deepbots is the primary existing wrapper attempting a generic interface, but it suffers from the aforementioned limitations (outdated standard, dependencies). We are not aware of other established, generic wrappers for Webots that leverage the current Gymnasium standard and offer the architectural simplification our framework provides. By addressing these points, our framework offers a more standardized, compatible, and simplified approach for applying modern deep reinforcement learning techniques within Webots compared to previous methods like Deepbots. We hope this clarification adequately addresses your comment.

Thank you for your insightful question regarding the use of domain randomization. Yes, domain randomization was utilized during the training phase to enhance the robustness of the learned policies and improve their potential for real-world deployment. We employed two main strategies for domain randomization within our framework, specifically applied during the training of the RLRobot instances described in our third design pattern (we changed it in new version of this paper) (see Figure 3). Initial State Randomization: At the beginning of each training episode, random actions are applied for a variable, randomized duration. This ensures the agent experiences a wider variety of starting conditions. Online Action Perturbation: Throughout the training process, random actions are probabilistically introduced at training steps. This is implemented via a specific method (d_randomize()) within the RLRobot class, which decides whether to inject noise based on a predefined probability. These techniques aim to expose the learning agent to variations beyond the standard simulation dynamics, thereby encouraging the development of more generalizable and robust policies. The parameters controlling the randomization (e.g., probability of perturbation, range of random actions) are configurable to allow balancing robustness enhancement with stable policy convergence.

We have updated the manuscript (in the section describing the third RL design pattern) to explicitly detail this implementation of domain randomization. Regarding metrics showing its impact on policy robustness: While our current work focuses on presenting the framework and the implementation of domain randomization within it, we acknowledge that dedicated experiments quantifying the specific robustness gains (e.g., comparing performance with and without randomization under varying sim-to-real gap conditions) are crucial. We plan to conduct these quantitative evaluations as part of our future work to explicitly measure the effectiveness of the implemented randomization techniques.

Thank you for your feedback regarding the citation coverage. We acknowledge the need for further support for our claims related to RL. In response to your comment, we have carefully reviewed the relevant literature and incorporated five additional, recent citations specifically addressing DRL. These new references have been integrated into the text where these concepts are discussed to better substantiate our discussion. Additionally, we have reviewed and slightly reorganized the existing references to improve overall clarity and flow. We believe these changes effectively address your note and strengthen the paper's connection to relevant prior work.

Thank you for your constructive feedback. We have paid close attention to your comment regarding the manuscript's language quality and have undertaken a comprehensive revision to improve clarity and rectify all identified typographical errors.

We believe these revisions significantly enhance the manuscript's clarity regarding our framework's unique contributions and how it advances the state of the art in facilitating robot learning in realistic virtual environments.

Thank you again for your insightful feedback. We are confident that these changes have improved the manuscript.

Sincerely,

Authors of the paper

Reviewer 2 Report

Comments and Suggestions for Authors

The article “Facilitating Robot Learning in Virtual Environments: A Deep Reinforcement Learning Framework” presents a framework to promote reinforcement learning to virtual robots, including twin robots that imitate perfectly real robots. The tool allows to model the robot, the environment where it is supposed to work and the task robot should realize.

In general, work is well presented and organized. Results of simulation utilizing two kinds of robots are presented, but not deployed in the real robots.

In the following are presented some points to be corrected in a final version.

In the Introduction section, sentences like 1 and 2 below need reference to support.

1- “Reinforcement learning applications in virtual environments has already demon-strated significant potential across industries.”

2- “Robots trained via RL in simulation have been successfully deployed in manufacturing for precision assembly tasks, in healthcare for automated assistance, and even in space exploration for autonomous navigation.”

There is a lot of problems with reference citation. The format is (Author, year) but it appears with or without comma. When the reference has more than three authors, one should use the surname of the first author followed by “et al.”. For example:

Kirtas, M., Tsampazis, K., Passalis, N., & Tefas, A. (2020). Deepbots: A webots-based deep reinforcement learning framework for robotics. In Artificial Intelligence Applications and Innovations: 16th IFIP WG 12.5 International Conference, AIAI 2020, Neos Marmaras, Greece, June 5–7, 2020, Proceedings, Part II 16 (pp. 64-75). Springer International Publishing.

It should be cited as (Kirtas et al., 2020), but authors cited it in the text like (Kirtas 2020) and (Kirtas et al. (2020)).

In some reference citations authors used “et. al.”, but only al. is an abbreviation.

Sentence below is missing a reference for CartPole.

“It provides a stand-ardized interface for a diverse range of simulated environments, from simple tasks like CartPole to intricate robotic control simulations.”

Text is missing a description of the class RLModel, similar to the presented for RLRobot.

“This pattern is composed of two key classes: RLRobot and RLModel. The RLRobot class defines the robots to be trained and is responsible for gathering observations from the environment, executing actions, and resetting the envi-ronment to its initial state once the robot either achieves its goal or encounters a failure during the learning process.”

Figures appear misaligned with the text.

Figure 1, Figure 2 and Figure 3 present diagrams where blocks are connected with others by oriented lines. Are these lines indicating direction of communication? What is the difference between the two kinds of arrows utilized?

About Figure 1 (right), it is not clear how is the communication between these blocks since nodes are connected using directional arcs in only one direction. So, it is not possible to the RLModel send information or commands to the SlaveRobot (or even for the SupervisorRobot).

According to the Figure 2, there is no connection between RLModel and SlaveRobot. RLModel only receives information from the SupervisorRobot. It seems impossible to realize what is described in the sentence below?

“The RLModel generates an action based on these observations, which is transmitted back to the SlaveRobot using the Emitter.”

Figure 3 is being cut off in the superior part.

Sentence below needs a reference to Tensor Flow Agents.

“The creation of a digital twin begins with designing an RL-compatible environment, following standard APIs like OpenAI Gym to ensure compatibility with RL libraries such as Stable-Baselines3 and TensorFlow Agents.”

Why Table 1 is not aligned with the text?

Section 5-Example robots and experiment results brings results of training the robots, not of deploying the trained models in real robots to realize experiments.

Comments for author File: Comments.pdf

Author Response

Dear Reviewer,

Thank you for your constructive feedback. We have taken your comments into careful consideration.

In response, we have revised the Introduction section to provide stronger support for the claims regarding the industrial potential of reinforcement learning applications in virtual environments and the successful deployment of simulation-trained robots. Specifically, we have added five additional references that substantiate the following points: 1. Reinforcement learning applications in virtual environments have already demonstrated significant potential across industries. The new references outline a diverse range of RL applications and their effectiveness in simulated settings, thus strengthening the statement. 2. Robots trained via RL in simulation have been successfully deployed in manufacturing for precision assembly tasks, in healthcare for automated assistance, and even in space exploration for autonomous navigation. The added citations provide substantial evidence from recent studies and industrial applications, illustrating the practical impact of simulation-trained robots across these sectors.

We have carefully revised all references and have made the following corrections in accordance with the requirements of the Applied Sciences journal: 1. Consistent Citation Format: We have standardized the citation format. Supplementary Reference for CartPole: We have added the missing reference for CartPole to the manuscript in support of the sentence: “It provides a standardized interface for a diverse range of simulated environments, from simple tasks like CartPole to intricate robotic control simulations.”

We appreciate you pointing out the missing description for the RLModel class. We acknowledge this oversight and have revised the relevant section to include a detailed explanation of its role and responsibilities, mirroring the level of detail provided for the RLRobot class. Specifically, we have added text clarifying that RLModel functions as an interface for implementing various reinforcement learning algorithms within our framework. We have also detailed its core responsibilities, including the implementation of the key methods: learn() for training the model via simulation, predict() for evaluating the trained model and rendering its behavior, and save() for persisting the model.

Thank you for your insightful feedback regarding our figures and their corresponding explanations. We have thoroughly revised both the diagrams and the related text to address your concerns:

Figure Alignment and Layout: We have corrected the alignment issues across Figures 1, 2, and 3. In particular, Figure 3 has been updated to ensure that no part of the diagram is cut off.

Clarification of Arrow Conventions in UML Diagrams: In Figures 2 and 3, the oriented lines have been employed to represent two types of relationships as per UML notation. Specifically, one type of arrow indicates an inheritance relationship, while the other denotes a usage dependency. The latter illustrates that one class explicitly creates or utilizes an instance of another class. We have clarified these distinctions in the revised text to ensure that it is evident what each arrow represents.

Communication Flow Between Components: Concerning the apparent absence of a direct connection between the RLModel and the SlaveRobot, this is intentional. The communication between these two components is mediated via the SupervisorRobot in accordance with the Webots design. This intermediary role of the SupervisorRobot is essential, as it aligns with the architectural constraints and communication protocols dictated by the environment. Thus, while the diagram does not display a direct link from the RLModel to the SlaveRobot, the action generated by the RLModel is transmitted to the SlaveRobot through the SupervisorRobot using the Emitter.

We believe these revisions enhance the credibility of our claims and align the manuscript with the journal’s standards.

Sincerely,

Authors of the paper

Reviewer 3 Report

Comments and Suggestions for Authors

The introduction presents some benefits of simulation environments and digital twins but does not discuss any disadvantages - e.g. time/cost needed for the creation of the environment/models, dangers of omitting important details in the models, time/care needed to adapt to changes, etc.

Is UML used in Fig. 2/4? Please, specify the modelling language and the type of diagram (e.g. "class diagram").

Some typos should be corrected, e.g. "Following A significant drawback".

If I understood correctly, the "slave robot/robot" in the proposed models does not access directly the RLModel. Does this mean that in a real-world implementation, the RLModel is always accessed remotely (via emitter/receiver) and never runs on the embedded system in the "slave robot/robot"?

In Section 5, I do not see any direct application/discussion of the models presented in FIg.2/3.

Overall, I liked the paper as it gives a good and understandable broad overview of the topic. The provided open source materials are also useful. Nevertheless, I think that it would be beneficial to connect the various paper sections more closely to one another, if possible. In addition, please give some more details about the simulation setup, e.g. hardware parameters, software libraries/modules/versions and settings of the environment.

One important question that interests me in particular concerns the computing resources needed to train/use the proposed RL model, e.g. can the model run on the robot itself (e.g. on a Cortex-A/Cortex-M MCU)?

Author Response

Dear Reviewer,

Thank you for your thorough review and constructive feedback on our manuscript. We appreciate your positive comments regarding the overview and open-source materials, as well as your insightful suggestions for improvement. We have carefully considered each point and have revised the manuscript accordingly.

Discussion of Simulation Disadvantages: Thank you for pointing out the need for a balanced perspective. We have revised the Introduction to include a discussion on the potential disadvantages of simulation environments, such as the time and cost associated with creating accurate models, the risks of the sim-to-real gap if crucial details are omitted, and the effort required for adaptation to changes.

Clarification of Diagrams (Figs 2/4): We apologize for the lack of clarity. We have updated the captions and relevant text for Figures to explicitly state that they are UML class and sequence diagrams.

We have also clarified the conventions used for the arrows within these diagrams, specifying that they represent inheritance relationships or usage dependencies (where one class utilizes an instance of another), consistent with standard UML notation.

Typo Correction: Thank you for catching the typo. We have conducted a careful proofread of the entire manuscript to address any other typographical errors, aiming to improve overall language quality as also suggested by other reviewers.

RLModel Access and Real-World Implementation: Your understanding is correct. In the proposed architectural pattern leveraging Webots' communication mechanisms (Emitter/Receiver, often managed via a Supervisor controller), the RLModel typically runs as a separate process (e.g., on the host PC running the simulation) and communicates remotely with the RLRobot (or its controller). This means it is not intended to run directly on the simulated robot's embedded system in this configuration. This indirect communication path via the Supervisor/Emitter is inherent to the Webots architecture we utilized. The computational requirements discussed below further highlight the current challenges of running such models directly on resource-constrained embedded hardware.

Connection of Section 5 to Figs 2/3: We appreciate you highlighting the need for a clearer connection. The experiments and case studies presented in Section 5 were indeed implemented based directly on the architectural patterns detailed in Figures 2 and 3. The interaction between RLRobot and RLModel forms the core of the experimental setup. To make this explicit, we have added brief references within Section 5 to clearly link the methodology used back to the architectural diagrams in Figures 2 and 3.

Simulation Setup Details: Thank you for this question. Hardware: The primary machine used for training and testing (e.g., Intel i5-14500 CPU, 32 GB RAM). Software: Key software versions used (e.g., Webots, Python , OpenAI Gymnasium , Stable-Baselines3).

Computing Resources for Training/Use: Training was conducted on a standard PC (Intel i5-14500, 32 GB RAM, without significant GPU acceleration observed). Training time varied depending on the complexity of the task: up to approximately 5 hours for the first robot scenario and up to 2 days for the second, more complex CartPole balancing. We also tested training on an Nvidia 3080 GPU, but the performance improvement was not significant. We hypothesize this might be due to the Webots simulation rendering each step for observation generation, potentially creating a bottleneck irrespective of GPU compute power for the learning algorithm itself. This aspect requires further investigation in future work, potentially exploring headless rendering modes if feasible within the Webots framework's philosophy of realistic simulation.

We hope these revisions and clarifications adequately address your concerns and improve the quality and clarity of the manuscript. We appreciate the time and effort you dedicated to reviewing our paper.

Article Menu

Facilitating Robot Learning in Virtual Environments: A Deep Reinforcement Learning Framework

Further Information

Guidelines

MDPI Initiatives

Follow MDPI