1. Introduction
Reinforcement Learning in robotics holds promise for providing adaptive robotic behaviors in scenarios where traditional programming methods are challenging. However, applying RL in the real world still faces numerous challenges, including (1) the reality gap [
1] between simulation and physical robots, (2) the real-time delay between observation and action, and (3) the difficulty of scaling to multi-robot or multi-task settings. While these challenges often appear independently in the literature, they require a unified solution to make RL practical for robotic systems. A unified solution (framework) to these challenges can be built using the Robot Operating System, as it primarily unifies message types for sensory-motor interfaces, offers microsecond timing functionalities, and is available for most current commercial and research platforms.
Reality Gap: Robot-based reinforcement learning [
2,
3] usually depends on simulation models for learning robotic applications and transferring the learned knowledge to real-world robots. This stage remains a major bottleneck because most simulation frameworks face challenges in effectively showcasing how to transfer learned behaviors from simulation models to real robots. One of the main challenges is that the currently available robotics simulators cannot fully capture the exact varying dynamics and intrinsic parameters of the real world. Therefore, agents trained in simulation models cannot typically be directly generalized to the real world due to the domain gap (reality gap) introduced by the discrepancies and inaccuracies of the simulators. To overcome this issue, experimenters must perform additional steps to the learning task, which requires incorporating real-world learning [
4] and applying Sim-to-real [
5] or domain adaptation [
6] techniques to transfer the learned policies from simulation to the real world.
Real-time mismatch: Even after addressing these concerns, a key challenge in real-world robotic learning is managing sensorimotor data in the context of real-time scenarios [
7]. In robotic RL, ‘real-time’ refers to the ability of the environment to operate at a pace where the robot’s decision-making and execution of actions must occur within a specific time frame. This rapid pace is essential for the robot to interact effectively with its environment, ensuring that the processing of sensory data and the execution of actuator responses are both timely and accurate. This aspect is particularly critical when creating simulation-based learning tasks to transfer learning to real-world robots. Currently, in most simulation-based learning tasks, computations related to environment-agent interactions are typically performed sequentially. Therefore, to comply with the Markov Decision Process (MDP) architecture [
8], which assumes no delay between observation and action, most simulation frameworks pause the simulation to construct the observations, rewards, and other computations. In contrast, time advances continuously between agent- and environment-related interactions in the real world. Hence, learning is typically performed with delayed sensorimotor information, potentially impacting the synchronization and effectiveness of the agent’s learning process in real-world settings [
9]. Therefore, these turn-based systems do not mirror the continuous and dynamic nature of real-world interactions and can lead to a mismatch in the timing of sensorimotor events compared with real-world situations. These issues stem from the agent receiving outdated information about the state of the environment and the robot not receiving proper actuation commands to execute the task.
Multi-Robot/Task Learning: In modern RL research, there is a growing interest in leveraging knowledge from multiple RL environments instead of training standalone models. One of the advantages of this approach is that it can improve the agent’s learning by generalizing knowledge across different scenarios (domains or tasks) [
10]. Furthermore, combining concurrent environments with diverse sampling strategies can effectively accelerate the agent’s learning process [
11]. This leveraging process can expose the agent to learning multiple tasks simultaneously rather than learning each task individually (multi-task learning) [
12]. This is also similar to meta-learning [
13]-based RL applications, where the agent can quickly adapt and acquire new skills in new environments by leveraging prior knowledge and experiences through learning-to-learn approaches. Another advantage of concurrent environments is scalability, which allows for the simultaneous training of multiple robots in parallel, either in a vectorizing fashion or for different tasks or domain learning applications [
14]. Therefore, creating concurrent environments is crucial for efficiently utilizing computing resources to accelerate learning in real-world applications, where multiple robots must be trained and deployed efficiently. While several solutions such as SenseAct [
9,
15] exist in the literature for real-time RL-based robot learning, they predominantly focus on single-robot scenarios or systems comprising robots from the same manufacturer, limiting their applicability in heterogeneous multi-robot settings [
16]. Furthermore, they often overlook the computational challenges inherent in scaling to multiple robots, particularly the CPU bottlenecks that can arise from processing data from various sensors, such as vision systems that may require CPU-intensive preprocessing operations [
17,
18].
Programming Interface Fragmentation: Another challenge is the programming language gap between simulation frameworks and real-world robots from different manufacturers. Most current simulation frameworks used in RL are commonly implemented in languages like Python, C#, or C++. However, real robots typically have proprietary programming languages, such as RAPID, Karel, and URScript, or may utilize the Robot Operating System (ROS) for communication and control. Therefore, it is not possible to transfer the learned knowledge directly without recreating the RL environment in the recommended robot programming language to communicate with the physical hardware [
9]. Furthermore, this challenge also applies when learning needs to occur directly in the real world without relying on knowledge transferred from a digital model. These include cases such as dealing with liquids, soft fabrics, or granular materials, where the physical properties are challenging to model precisely in simulations [
19]. In these scenarios, the experimenters must establish a communication interface with the physical robots to enable the agent to directly interact with the real world. This process becomes more challenging if the task requires robots from multiple manufacturers, as they typically do not share a common programming language.
Fortunately, the Robot Operating System (ROS) presents a promising solution to some of these challenges. This is because ROS is widely acknowledged as the standard for programming real robots, and it receives massive support from manufacturers and the robotics community. This makes it an ideal platform for constructing learning tasks applicable to simulations and real-world settings. Currently, numerous simulation frameworks are available for creating RL environments using ROS, with most prioritizing simulation over real-world applications. A fundamental limitation of these simulation frameworks, such as OpenAI_ROS (
http://wiki.ros.org/openai_ros, accessed on 17 June 2025), gym-gazebo [
20], ros-gazebo-gym (
https://github.com/rickstaa/ros-gazebo-gym, accessed on 17 June 2025), and FRobs_RL [
21], is their inability to support the creation of real-time RL simulation environments due to their use of turn-based learning approaches. Therefore, the full potential of ROS for setting up learning tasks that can easily transfer learning to the real world is not utilized correctly. Furthermore, with the current offerings, ROS lacks Python bindings for some crucial system-level features needed to create RL environments, such as launching multiple ROScores, nodes, and launch files, which are currently confined to manual configurations (Command Line Interface—CLI approaches). Moreover, the full potential of ROS in creating real-time RL environments that achieve precise time synchronization, which is essential for aligning the sequence and timing of sensor data acquisition, decision-making processes, and actuator responses, thereby reducing latency in agent-environment interactions, has not been thoroughly studied yet. Addressing these gaps in ROS could further streamline the development of effective and efficient RL environments for robots.
Therefore, this study addresses the central question of “How to design an ROS-based reinforcement learning framework that supports both simulation and real-world environments, real-time execution, and concurrent training across multiple robots or tasks”. This paper presents a comprehensive framework designed to create RL environments that cater to both simulation and real-world applications. This includes adding support for ROS-based concurrent environment creation, a requirement for multi-robot/task learning techniques, such as multi-task and meta-learning, which enables the simultaneous handling of learning across multiple simulated and/or real RL environments. Furthermore, this study explores how this framework can be utilized to create real-time RL environments by leveraging an ROS-centric environment implementation strategy that bridges the gap between transferring learning from simulation to the real world. This aspect is vital for ensuring reduced latency in agent-environment interactions, which is crucial for the success of real-time applications.
Furthermore, this study introduces benchmark learning tasks to evaluate and demonstrate some use cases of the proposed approach. These learning tasks are built around the ReactorX200 (Rx200) robot by Trossen Robotics and the NED2 robot by Niryo and are used to explain the design choices. This study also lays the groundwork for multi-robot/task learning techniques, allowing for the sampling of experiences from multiple concurrent environments, whether they are simulated, real, or a combination of both.
Summary of Contributions:
Unified RL Framework: Development of a comprehensive, ROS-based framework (UniROS) for creating reinforcement learning environments that work seamlessly across simulation and real-world settings.
Concurrent Env Learning Support: Enhancement of the framework to support vectorized [
22] multi-robot/task learning techniques, enabling efficient learning across multiple environments by abstracting standard ROS communication tools into a reusable structure tailored for RL.
Real-Time Capabilities: Introduction of a ROS-centric implementation strategy for real-time RL environments, ensuring reduced latency and synchronized agent-environment interactions.
Benchmarking and Evaluation: Empirical demonstration through benchmark learning tasks, addressing these challenges using the proposed framework in three distinct scenarios.
3. Related Work
Most RL-based simulation frameworks for robots are built on simulators such as MuJoCo [
38], PyBullet [
39], and Gazebo [
40], which prioritize accelerated simulations for developing complex robotic behaviors, often with less emphasis on the seamless transition of policies to real-world robots. A recent advancement in this field is Orbit [
41] (Now Isaac Lab), a framework built upon Nvidia’s Isaac Gym [
42] to provide a comprehensive modular environment for robot learning in photorealistic scenes. It is distinguished by its extensive library of benchmarking tasks and capabilities that potentially ease policy transfer to physical robots with ROS integration. However, at the current stage, its focus remains mainly on simulation rather than direct real-world learning. Although it provides tools for simulated training and real-world applications, it may not yet serve as a complete solution for real-world robotics learning without additional customization and system integration efforts. Furthermore, the high hardware requirements (
https://docs.isaacsim.omniverse.nvidia.com/latest/installation/requirements.html, accessed on 17 June 2025) of Isaac Sim may restrict accessibility for many researchers and roboticists, limiting its widespread adoption.
SenseAct [
9] is a notable contribution that highlights the challenges of real-time interactions with the physical world and the importance of sensor-actuator cycles in realistic settings. They proposed a computational model that utilizes multiprocessing and threading to perform asynchronous computations between the agent and the real environment, aiming to minimize the delay between observing and acting. However, this design is primarily tailored for single-task environments and shows limitations when extended to multi-robot/task research, including learning together with simulation frameworks or concurrently in multiple environments. This limitation partly stems from its architecture, which allocates a single process with separate threads for agents and environments. The scalability of this approach, particularly for concurrent learning with multiple RL environments, is hindered by Python’s Global Interpreter Lock (GIL) (
https://wiki.python.org/moin/GlobalInterpreterLock, accessed on 17 June 2025), which restricts parallel execution of CPU-intensive tasks. Hence, incorporating multiple RL environment instances within a single process is not computationally efficient, especially when real-time interactions are critical. Furthermore, the difficulty in synchronizing different processes and establishing communication layers with various robots and sensors from different manufacturers may limit the potential of their proposed approach.
Table 1 provides a comprehensive comparison between UniROS and existing RL frameworks, focusing on ROS integration, real-time capabilities, and multi-robot support. Unlike most prior tools, which are either simulation-centric or designed for single-robot real-world use, UniROS is uniquely positioned to support scalable and low-latency training across both simulation and physical robots concurrently.
In addition to comprehensive frameworks, several studies have addressed specific aspects of bridging simulation and real-world robot learning. Many domain randomization approaches [
43,
44] either dynamically adjust the simulation parameters based on real-world data or vary the simulation parameters to improve the sim-to-real transfer. However, their methods often require extensive manual tuning of randomization ranges and do not address the fundamental timing mismatches between simulations and real-world execution. While other domain adaptation approaches [
45,
46], leverage demonstrations in both simulation and real-world settings to accelerate robot learning, their approach requires separate implementations for each domain. As these approaches do not provide a unified interface for concurrent learning across simulated and real environments, they highlight the need for more efficient frameworks that can leverage both simulated and real-world data concurrently.
8. Benchmark Tasks Creation
This section discusses the development of benchmark tasks using both MultiROS and RealROS packages. These simulated and real environments are then used in the subsequent sections to explain and evaluate the proposed real-time environment implementation strategy and to demonstrate some use cases of the UniROS framework. These tasks are modeled closely after the Reach task of the OpenAI Gym Fetch robotics environments [
51], where an agent learns to reach random target positions in 3D space. In each Reach task, the robot’s initial pose is the default “
home” position of the robot (typically set to zero for all the joint angles of the robot), and the agent’s goal is to move the end-effector to a target position
to complete the task. Therefore, each task generates a random 3D point as the target at each environment reset, and the task is completed when the end-effector reaches the goal within
Euclidean distance where
(reach tolerance) is set to 0.02 m. However, unlike the Fetch environments, where the action space represents the Cartesian displacement of the end-effector, these tasks use the joint positions of the robot arm as actions. This selection was motivated by the fact that joint position control typically aligns better with realistic robot manipulations, offering enhanced precision and simpler action spaces. The following describes the details of the ReactorX 200 and NED 2 robots and the Reacher tasks (Rx200 Reacher and Ned2 Reacher) creation.
The ReactorX 200 robot arm (Rx200) by Trossen Robotics is a five-degree-of-freedom (5-DOF) arm with a 550 mm reach. It operates moderately at a hardware control loop frequency of 10 Hz. This compact robotic manipulator is most suitable for research work and natively supports ROS Noetic (
http://wiki.ros.org/noetic, accessed on 17 June 2025) without requiring additional configuration or setup. It connects directly to a PC via a USB cable for communication and control, providing a reliable and straightforward method of connectivity (using the default ROS master port). All the necessary packages for controlling the Rx200 using ROS are currently available from the manufacturer as public repositories on GitHub (
https://github.com/Interbotix/interbotix_ros_manipulators, accessed on 17 June 2025). Similarly, the NED2 robot by Niryo is also designed for research work and features six degrees of freedom (6 DOF) with a 490 mm reach. It has a slightly higher control loop frequency of 25 Hz and natively runs ROS Melodic (
https://wiki.ros.org/melodic, accessed on 17 June 2025) on an enclosed Raspberry Pi. Niryo offers three communication options for connecting the NED2, including a Wi-Fi hotspot, direct Ethernet, or connecting both devices to the same local network. As SSH-based access was not desirable, this study opted for a direct Ethernet connection and utilized the ROS multi-device mode, as described in
Section 6.2, to ensure a robust communication setup. Furthermore, Niryo also provides the necessary ROS packages (
https://github.com/NiryoRobotics/ned_ros, accessed on 17 June 2025) to be installed on the local system, enabling custom messaging and service interfaces to access and control the remote robot through ROS.
Since two variants of the
Base Env (standard and goal-conditioned) are available in this framework, two types of RL environments were created for both simulation and the real world. These simulated and real environments involve continuous actions and observations and support sparse and dense reward architectures. In the created goal-conditioned environments, the agent receives observations as a Python dictionary containing typical, achieved, and desired goals. The achieved goal is the current 3D position of the end-effector
, which is obtained using FK calculations, and the desired goal is the randomly generated 3D target
. One of the decisions made during task creation is to include the previous action as part of the observation vector, as this can minimize the adverse effects of delays on the learning process [
52]. Additionally, the observation vector includes the position of the end-effector with respect to the base of the robot, the current joint angles of the robot, the Cartesian displacement, and the Euclidean distance between the EE position and the goal. Additional experimental information, including details on actions, observations, and reward architecture, is provided in
Appendix A.
Furthermore, specific constraints were implemented on the operational range of both types of environments to ensure the safe operation of the robot and prevent any harm to itself or its surroundings. One of the steps taken here is to limit the goal space of the robot so that it cannot sample negative values in the z-direction in the 3D space. This is vital since the robot is mounted on a flat surface, making it impossible to reach locations below it. Additionally, before the agent executes the actions with the robot, the environment checks for potential self-collision and verifies whether the action would cause the robot to move toward a position in the negative z-direction. Therefore, the forward kinematics are calculated using the received actions before executing them to avoid unfavorable trajectories, allowing the robot to operate within a safe 3D space. Hence, considering the complexity of the tasks and compensating for the gripper link lengths, the goal space was meticulously refined to have a maximum 3D coordinates of and a minimum of (in meters) for both robots.
As for the learning agents of the experiments in this study, the vanilla TD3 was used for standard-type environments and TD3 + HER for goal-conditioned environments. TD3 is an off-policy RL algorithm that can only be used in environments with continuous action spaces. It was introduced to curb the overestimation bias and other shortcomings of the Deep Deterministic Policy Gradient (DDPG) algorithm [
53]. Here, TD3 was extended by combining it with Hindsight Experience Replay (HER) [
54], which encourages better exploration of goal-conditioned environments with sparse rewards. By incorporating HER, TD3 + HER improves the sample efficiency because HER utilizes unsuccessful trajectories and adapts them into learning experiences. This study implemented these algorithms using custom TD3 + HER implementations and the Stable Baselines3 (SB3) library, adding ROS support to facilitate their integration into the UniROS framework. The source code and supporting utilities are available on GitHub (
https://github.com/ncbdrck/sb3_ros_support, accessed on 17 June 2025), allowing other researchers and developers to leverage and build on this work. Detailed information on the RL hyperparameters used in the experiments is summarized in
Appendix B. Furthermore, all computations during the experiments were conducted on a PC with an Nvidia 3080 GPU (10 GB VRAM) and an Intel i7-12700 processor with 64 GB DDR4 RAM.
9. Evaluation and Discussion of the Real-Time Environment Implementation Strategy
This section examines the intricacies of the proposed ROS-based real-time RL environment implementation strategy, utilizing benchmark environments as an experimental setup. The primary goal here is to discuss the two main hyperparameters of the proposed implementation strategy and gain an understanding of how to select suitable values for them, as they largely depend on the hardware capabilities of the robot(s) used in the learning task. Initially, experiments were conducted to investigate different action cycle times and environment loop rates to uncover the intricate balance between control precision and learning efficiency. Subsequently, the exploration was extended to include an empirical evaluation of asynchronous scheduling within the proposed environment implementation strategy. This process involves a thorough analysis of the time taken for each action and actuator command cycle across numerous episodes.
9.1. Impact of Action Cycle Time on Learning
In the real-time RL environment implementation strategy, the action cycle time (step size) is a crucial hyperparameter that determines the duration between two subsequent actions from the agent. The selection of this duration impacts the learning performance due to the use of action repeats in the environment loop. Action repeats ensure that the robot can perform smooth and continuous motion over a given period, especially when the action cycle time is longer. This technique helps to stabilize the robot’s movements and maintain consistent interaction with the environment between successive actions.
Selecting a shorter action cycle time, close to the environmental loop rate, would reduce the reliance on action repeats and enable faster data sampling from the environment due to more frequent agent-environment interactions. This would allow the agent to have finer control (high precision) over the environment at the cost of observing minimal changes. Such minimal changes can adversely affect training, as the agent may not perceive significant variations in the observations necessary for effectively updating deep neural network-based policies such as TD3. Conversely, selecting a longer action cycle time could lead to more action repeats and substantial changes in observations between successive actions, potentially easing and enhancing the learning process for the agent. However, this comes at the risk of reduced control precision and potentially slower reaction times to environmental changes, which can be detrimental in highly dynamic environments. Furthermore, this could potentially slow down the agent’s data collection rate, leading to a longer training time.
Therefore, to study the effect of action cycle time on learning, experiments were conducted with multiple durations, selecting a baseline, and comparing the effects of longer action cycle times, as depicted in
Figure 6. These experiments were conducted in the real-world Rx200 Reacher task using the same initial values and conditions and employing the vanilla TD3 algorithm. This figure contains three graphs that illustrate the learning curves of the training process for all selected action cycle times.
Figure 6a shows the mean episode length, which represents the mean number of interactions that the agent has with the environment while attempting to achieve the goal within an episode. Ideally, the episode length should be shortened over time as the agent learns the optimal way to behave in its environment. Similarly,
Figure 6b depicts the mean total reward obtained per episode during training. The agent’s goal is to maximize this reward by improving its policy and learning to complete the task efficiently. Furthermore, in the benchmark tasks, the maximum allowed number of steps per episode is set to 100, providing a maximum of 100 agent-environment interactions to achieve the task. Exceeding this limit resulted in an episode reset and failure to complete the task. These failure conditions and successful task completion conditions are used to illustrate the success rate curve in
Figure 6c (refer to
Appendix A for more information).
In the experiments, the baseline was set at 100 ms, matching the duration of the hardware control loop frequency of the Rx200 robot (10 Hz), which is used as the default environment loop rate (10 Hz) of the Rx200 Reacher benchmark task. This selection represents the shortest action cycle time that can be used in this benchmark task because the Rx200 robot does not function properly below this duration. The training was then repeated to obtain learning curves for action cycle times of 200, 400, 600, 800, 1000, and 1400 ms. Furthermore, each run of the experiment was conducted for 30,000 steps, allowing sufficient time for each run to find the optimal policy. However, data points illustrated in
Figure 6 were smoothed using a rolling mean with a window size of 10, plotted every 10th step, and shortened to the first 15K steps to improve readability.
As shown in
Figure 6, increasing the action cycle time can improve performance up to a certain point compared to using the same time duration as the environment loop rate (100 ms). For this benchmark task, the learning curves for action cycle times of 600 ms and 800 ms showed the best performance, quickly stabilizing with shorter episode lengths, higher total rewards, and higher success rates. This improvement can be attributed to the balance between sufficient observation changes and the agent’s ability to interact with the environment effectively. However, as the action cycle time increased beyond 800 ms, the performance started to degrade, as the learning curves for action cycle times of 1000 ms and 1400 ms required a larger number of steps to stabilize to an optimal policy. This decline in performance is likely due to the agent receiving less frequent updates, which introduces potentially more significant errors in the policy updates, causing the agent to struggle to maintain optimal behavior.
Overall, the experiments demonstrate that while increasing the action cycle time can initially improve learning by providing more substantial observation changes, there is a threshold beyond which further increases become detrimental. Therefore, the choice of action cycle time and the use of action repeats must be balanced based on the specific requirements of the task and the capabilities of the robot. Fine-tuning these parameters is crucial for optimizing learning performance and ensuring robust real-time agent-environment interactions.
9.2. Impact of Environment Loop Rate on Learning
To assess the impact of various environment loop rates, the same learning process was repeated using rates of 1, 5, 10, 20, 50, and 100 Hz. Here, the action cycle time was set for each run to match the environment loop rate to simplify the task by eliminating the action repeats. Furthermore, the baseline for these experiments was set to 10 Hz to align with the hardware control loop frequency of the Rx200 robot used in the benchmark task. Similar to the previous section,
Figure 7 illustrates the learning curves across different environment loop rates, with the same post-processing methods employed to enhance the readability of the curves.
As shown in
Figure 7, at lower environment loop frequencies (1 and 5 Hz), the learning performance was better than the baseline of 10 Hz. This improvement can be attributed to the longer action cycle times that take larger actions (joint positions), which result in more significant variability in observations, aiding the learning process. However, the performance begins to degrade as the environment loop rate increases beyond 10 Hz. The learning curves for higher loop rates (20, 50, and 100 Hz) show increased mean episode lengths and lower mean rewards, indicating less efficient learning. This decline is due to the inability of robots to effectively process commands and read joint states at higher frequencies. Although the control commands are sent at a higher rate, the hardware control loop of the robot operates at 10 Hz, causing instability in command execution. Furthermore, as ROS controllers typically do not buffer control commands, the hardware control loop processes only the most recent command on the relevant ROS topic, leading to instability during training.
9.3. Empirical Evaluation of Asynchronous Scheduling of the Real-Time Environment Implementation Strategy
To empirically validate the real-time RL environment implementation strategy, the time taken for each action cycle and each actuator command cycle across numerous episodes was logged. The experiments were conducted using both the Rx200 and Ned2 Reacher, with environmental loop rates of 10 and 25 Hz, respectively. These rates correspond to a 100 ms period and 40 ms to send actuator commands to the robot, with an action cycle time set to 800 ms, providing ample time to execute action repeats. The goal was to observe the effectiveness of the asynchronous scheduling approach in managing agent-environment interactions and quantitatively measure the latency inherent in the system.
The boxplot in
Figure 8 depicts the distribution of cycle durations for the actuator and action cycles within the Rx200 Reacher task during training in a real-world environment. It shows a median action cycle time of 803.3 ms and a median actuator cycle of 100.11 ms, which closely approximates the respective preset threshold values for the benchmark task. Furthermore, despite the presence of some outliers, the compact interquartile ranges in both plots indicate that the system performs with a high degree of consistency, with negligible variability. However, it should be noted that the variations observed in the action cycle durations are partly due to the use of the TD3 implementation in the Stable Baselines3 (SB3) package as the learning algorithm. SB3 is a robust framework for reinforcement learning. However, it is not explicitly designed for robotic applications or real-time training scenarios and is more of a general-purpose RL library. Therefore, SB3 does not typically schedule policy updates immediately after sending an action or use asynchronous processing to update the policy, which can introduce delays and variations. Therefore, the absence of asynchronous scheduling in SB3 means that the agent waits for the policy update to be completed before proceeding with the following action. This synchronous approach can lead to slight variations in the action cycle times, as shown in the box plot.
One solution to this issue is to develop custom RL implementations that incorporate asynchronous policy updates [
55]. This approach allows the policy to be updated in the background while the agent continues to interact with the environment, thereby reducing latency and improving the efficiency of real-time learning. By scheduling policy updates asynchronously, these methods ensure that agent-environment interactions are not interrupted, maintaining the consistency and precision required for effective real-time learning. To evaluate the impact of this approach, additional experiments were conducted using a custom TD3 implementation in which policy updates were explicitly scheduled asynchronously relative to data collection. As illustrated in
Figure 9, the Rx200 Reacher demonstrated improved temporal consistency in action execution. The Rx200 Reacher task displayed a similar median action cycle time of 803.3 ms, with a narrower interquartile range than the standard SB3 implementation. This reduction in variability confirms that asynchronous scheduling effectively mitigates timing disruptions introduced by synchronous policy updates, leading to improved temporal consistency during task execution.
Furthermore, to evaluate the proposed approach under a high computational load, the experiments were extended to support concurrent learning across both simulated and real environments. In this setup, all four environments (two simulated and two real environments, namely, the Rx200 Reacher and Ned2 Reacher) were trained simultaneously using the asynchronous policy update mechanism. The results, as depicted in
Figure 10, show that the distribution of the action and actuator cycle times across all four tasks remained consistent with those observed in the single-environment experiments. Each subplot within the composite boxplot illustrates minimal variation, with median values closely aligning with the predefined cycle thresholds, indicating that asynchronous scheduling sustains reliable timing even under multi-environment execution. These results support the proposed concurrent processing methodology, which minimizes overall system latency and facilitates real-time agent-environment interactions.
9.4. Discussion
All experiments in this study were conducted using a single robot in each RL environment. However, as discussed in the previous sections, the proposed UniROS framework enables the use of multiple robots in the same RL environment, particularly when they need to collaborate to complete a task. If the robots used in the task are of the same make and model, the experimenters can use the hardware control loop frequency of the robots as the environment loop rate. However, this setup could introduce additional complexities, particularly when the robots have different hardware control loop frequencies. For instance, consider that the two robots used in this study are combined in a single RL environment, where the Rx200 robot has a hardware control loop frequency of 10 Hz and the Niryo NED2 operates at 25 Hz. In such a scenario, it is desirable to use a lower hardware control loop frequency (in this scenario, 10 Hz) as the environment loop rate for the entire system to ensure synchronized operation. This approach prevents the faster robot from issuing commands more frequently than the slower robot can keep up with, thereby maintaining consistent interaction with both robots. Furthermore, pairing a slower robot like the Rx200 (10 Hz) with a more industrial-grade manipulator, such as the UR5e robot, which has a hardware control loop frequency of 125 Hz, may not be ideal. The disparity in the control loop rates can lead to inefficiencies and instability in the learning process. A slower robot can become a bottleneck, hindering the performance of a faster robot and potentially disrupting overall task execution.
Additionally, when initializing the joint_state_controller that publishes the robot’s joint state information, the publishing rate can be set to any frequency. This leads to the ROS controller publishing at the specified rate, even though it differs from the hardware control loop frequency of the robot. While setting a higher frequency does not impact learning, as the ROS controller publishes the same joint state information multiple times, setting a lower frequency leads to degraded performance because the agent does not receive the most up-to-date information from the robot. Therefore, the most straightforward solution is to adjust the publishing rate of the joint_state_controller to match the hardware control loop frequency or higher, as this ensures that each published message corresponds to an actual update from the hardware.
Similarly, suppose an external sensor in the task operates at a lower rate than the hardware control loop frequency of the robot. In such cases, it is essential to account for this in the environment loop of the RL environment. This could mean using the latest available sensor data, even if they are not updated at every loop iteration. In these scenarios, action repeats can be beneficial, especially when dealing with robots with higher hardware control loop frequencies. This makes learning easier for the RL agent by receiving observations that display more substantial changes at each instance rather than infrequent and minimal changes that make learning harder.
10. Use Cases
Three possible use cases of UniROS are presented, each highlighting a unique aspect of its application. The first use case demonstrates the training of a robot directly in the real world, showcasing how to utilize the framework for learning without relying on simulation. The second use case demonstrates zero-shot policy transfer from simulation to the real world, highlighting the capability of the proposed framework to transfer learned policies from simulation to the real world. Finally, the last use case demonstrates the ability of the framework to learn policies applicable to both simulated and real-world environments. In these use cases, the environment loop rate was set to 10 Hz, and the action cycle time was set to 800 ms, as this configuration showed the best results for the Rx200 Reacher task in
Section 9.1. Similarly, an environment loop rate of 25 Hz and the same action cycle time of 800 ms were used for the Ned2 Reacher due to the multi-task learning setup in one of the use cases (
Section 10.3), ensuring that both robots received actions at the same temporal frequency. This consistency in action dispatching across tasks facilitates stable training and improves coordination when learning shared representations in multi-robot learning setups.
10.1. Training Robots Directly in the Real World
The first use case is demonstrated using the Rx200 Reacher task. Here, the physical robot was directly trained in the real world using TD3 (for a standard-type environment) and TD3 + HER (for a goal-conditioned environment). To evaluate the performance of learning, the success rate, mean total reward per episode, and number of steps (agent-environment interactions) taken by the robot to reach the goal position in an episode were plotted.
Figure 11 shows the learning metrics of the trained robot in each environment. Because the experiment using the standard-type environment with the stated action cycle duration of 800 ms and environment loop rate of 10 Hz has already been conducted and showcased in
Figure 6, the curves were replotted in
Figure 11a for readability and to ensure that the results are easily interpretable without the complexity of multiple curves in a single figure. Therefore, to provide a clear presentation of the learning curves for both environments,
Figure 11 was divided into two parts. Part (a) presents the learning metrics for the standard-type environment, and part (b) presents the learning metrics for the goal-conditioned environment. The learning curves of the next two use cases also follow the same convention.
Here, it was observed that both environments performed well on the Reacher task, as their success rates and mean rewards steadily increased, while the average steps gradually decreased as learning progressed. Furthermore, the plateau of the success rate in both environments indicates that the robot learned a near-optimal policy for the given task. These results demonstrate that the proposed study enables the direct training of robots in the real world using RL with a single experience stream.
Furthermore, it is essential to note that during the initial stage of training, challenges arose because some joints attempted to move beyond the restricted workspace. Therefore, the initial solution of simply restricting the end-effector pose to be within the workspace proved insufficient. Instead, the action (containing joint position values) was used to calculate the forward kinematics (FK) for all joints to check if any were trying to exceed the workspace limits. If any of the resulting joints were found to be out of bounds, the robot was prevented from executing the action. This additional step ensured the safety and stability of the robot during training. These findings highlight the feasibility and benefits of using the UniROS framework for direct real-world training of robotic systems, laying the groundwork for more complex future applications.
Additionally, it should be noted that the learning progress differs between standard-type and goal-conditioned environments, with the former achieving a near-optimal policy before 10K steps and the latter taking around 50K to 80K steps. A detailed explanation of this disparity is provided in
Appendix C, which discusses the differences in reward architectures and environment types, highlighting how dense rewards in standard-type environments facilitate quicker convergence compared to sparse rewards in goal-conditioned environments.
10.2. Simulation to Real-World
This section presents the experimental results obtained from training the Rx200 robot in simulation environments and subsequently transferring the learned policies to a physical robot. This experiment utilizes the simulated Rx200 Reacher task environments and employs the same RL algorithms to train the agents. These simulated environments were created following the proposed real-time implementation strategy, with the same environment loop rate and action cycle time as the previous real-world use case. Furthermore, to ensure seamless policy transfer from the simulation to the real world, the simulated environments were configured to mirror real-world conditions as closely as possible, which includes not pausing the simulation. The primary aspect of this process was to mimic the hardware control loop of the real robot within the Gazebo simulation, ensuring that the timing of the environment loop in the simulated environment closely matched that of the real-world counterpart.
Figure 12 shows the constructed real-world environment in the Gazebo simulator.
Additionally, the simulated robot’s URDF file (robot description) contains the
gazebo_ros_control (
https://classic.gazebosim.org/tutorials?tut=ros_control, accessed on 17 June 2025) plugin, which loads the appropriate hardware interfaces and controller manager for the simulator to update the hardware control loop frequency of the real robot. This plugin configuration ensures that actuator commands are processed at the desired control frequency that matches the real-world RL environment. However, it was observed that manually setting this parameter in the URDF can sometimes cause the robot to exhibit unexpected behaviors in the simulation. One solution for these scenarios is to use the hardware control loop frequency of the real robot as the environment loop rate, forcing the simulated robot to receive control commands and operate at the actual hardware control loop frequency. Therefore, setting these configurations enables learning in a simulated environment that closely resembles real-world learning.
As illustrated in
Figure 13, the learning curves were plotted to evaluate the learning performance in both types of simulated environments. Similar to the previous use case, both agents learned nearly an optimal policy, as the success rate plateaued for the Reacher task. Furthermore, the gradual increase in the success rate and mean reward, along with the decrease in the mean episode length, indicates that the learning was stable and consistently improved throughout the training. Once the trained policies were obtained from the simulation environments, a zero-shot transfer was performed directly for the physical Rx200 robot (
Figure 12) without employing any sim-to-real techniques or domain adaptation methods to bridge the reality gap. The primary objective of zero-shot transfer was to evaluate the ability of the proposed framework to generalize its learning from ROS-based simulation environments to the real world without requiring any additional training.
To evaluate policy transfer, 200 episodes of the Reach task were conducted to record the success or failure of each episode. This provides valuable insights into the performance of the transferred policies in physical robots. In this study, the trained TD3 model in the standard-type environment and the TD3 + HER model trained in the goal-conditioned environment achieved a nearly 100% success rate in the real world. This result indicates that the transferred policies can be generalized and accurately guide the physical robot without requiring additional fine-tuning.
However, it should be mentioned that while the relatively simple Rx200 Reacher task could achieve a successful zero-shot policy transfer without any additional fine-tuning, it will not necessarily be the same for all tasks. This is especially true for complex tasks involving additional sensors (cameras and lidars), as they may require domain randomization techniques, such as sampling data from multiple environments, each with different seeds and Gazebo physics parameters (which can be tuned using UniROS). In such cases, zero-shot transfer may not be feasible, and fine-tuning policies in the real world may be necessary. However, initial training in a simulated environment can also provide a good starting point for further optimization in the real world.
10.3. Concurrent Training in Real and Simulation Environments
This use case comprises two experiments. The first experiment demonstrates how real-world dynamics and kinematics can be learned concurrently using less expensive simulation environments to expedite the learning process. The second showcases one of the multi-robot/task learning approaches using the proposed framework and environment implementation strategy.
10.3.1. Learning a Generalized Policy
This experiment aims to demonstrate the capability of the proposed framework for training a generalized policy that can perform well in both domains by leveraging knowledge from simulations and real-world data using concurrent environments. This experiment was designed using the created Rx200 Reacher real and simulated environments to learn the same Reach task. Furthermore, to be consistent with the previous use cases, one generalized policy was trained for standard-type environments (simulated and real) with dense rewards, and another for goal-conditioned environments with sparse rewards. Here, the proposed framework’s ability to execute concurrent environments was exploited to enable synchronized real-time learning in both simulation and real-world environments.
In this experiment, an iterative training approach was employed to update a single policy by sampling trajectories from both real and simulated environments, enabling the agent to integrate real-world dynamics and kinematics into the learning process. Similar to the previous use cases, a TD3-based learning strategy was applied to standard-type environments and a TD3 + HER strategy to goal-conditioned environments. The implemented learning strategy is presented in Algorithm 1. The performance of the learning process was evaluated by plotting the learning curves, as shown in
Figure 14. As observed in the previous use cases, the gradual decrease in the mean episode length and increase in the rewards and success rate imply that learning was stable and consistent throughout training. Additionally, both agents learned nearly optimal policies, as indicated by the plateauing of the success rate. Furthermore, similar to the previous use case, deploying the trained agent in the respective domain yielded 100% accuracy in both types of environments.
Algorithm 1. Multi-task training strategy for TD3/TD3 + HER. |
![Sensors 25 05679 i001 Sensors 25 05679 i001]() |
While the results illustrated in
Figure 14 appear less impressive than those of the previous use cases, the main advantage of the use case lies in its demonstration of the capability to train concurrently in both real and simulated domains using the same policy. Here, the relatively simple nature of the Reach task does not fully reflect the potential benefits of this concurrent learning approach because the gap between simulation and real-world performance is minimal. However, the true strength of this concurrent learning strategy becomes more evident when it is extended to more complex robotic tasks, such as manipulation involving deformable objects, tool use, or contact-rich interactions, where discrepancies between simulated and real-world physics become more noticeable. In such scenarios, learning from real-world data is crucial for bridging the reality gap, while simulations continue to provide large-scale data for rapid iteration and policy refinement.
10.3.2. Multi-Task Learning
To demonstrate one of the multi-robot/task learning approaches, the previous experiment was extended by incorporating all the created task environments of Rx200 Reacher and Ned2 Reacher into the environments array of Algorithm 1. Then, by sampling experience from multiple environments, the agent is exposed to multiple tasks across different domains (robot types and physical/simulated instances), enabling it to learn the optimal behavior under varying conditions. Furthermore, to accommodate the differences in the observation and action spaces between the Rx200 and Ned2 Reacher environments, the agent architecture was configured using the largest observation space among the environments (from the Rx200 Reacher). For environments with smaller observation spaces, zero-padding was applied to match the input dimensions (for the Ned2 Reacher). Similarly, the largest action space across the environments (from the Ned2 Reacher) was used as the action dimension for the agent, and any additional unused actions were ignored when executing actions in environments with smaller action spaces (for the Rx200 Reacher). This design choice ensured compatibility across heterogeneous environments while maintaining the shared policy architecture.
Although more sophisticated techniques, such as using task-conditioned policies with task embeddings [
56] or adaptation layers [
57], could help manage heterogeneous input/output structures more elegantly, a simpler approach was deliberately chosen to keep the training pipeline minimal. This approach of padding and unifying action and observation spaces has also been used in prior multi-task reinforcement learning research [
58] and serves as a reasonable baseline when dealing with a limited number of tasks and known environment interfaces.
Similar to the previous use cases, the training curves for this experiment were plotted, as shown in
Figure 15. As before, it was observed that the training remained stable across both types of environments, as indicated by the gradual decrease in the mean episode length, and the agents successfully learned policies that converged toward optimal behavior with the plateauing of the success rate and the mean episode reward. The advantage of this approach lies in the agent’s ability to learn a single policy that generalizes well across multiple tasks and domain configurations, ultimately saving time and computational resources by avoiding separate training for each environment.
While the deployment of the trained agent in each environment showed similar 100% accuracy as with the previous use cases, it is important to note that these results are specific to simple Reach tasks, which have low-dimensional state and action spaces and minimal domain discrepancy. As more complex tasks are introduced, techniques like task embeddings, modular policies [
59], or adaptation layers may become necessary to manage the increased diversity in task structure and dynamics. Similarly, meta-learning algorithms such as Model Agnostic Meta-learning (MAML) [
60] may enable learning a model that can quickly adapt to new tasks with minimal data when the tasks are complex.
In summary, these experiments demonstrate the capability of the UniROS framework for training policies that perform well in both simulated and real-world environments. The ability to leverage both domains during training in concurrent environments without incurring synchronization bottlenecks provides a robust solution for developing versatile and adaptive robotic systems. The primary goal of these demonstrations is not to determine the best method for setting up learning tasks in the real world. Instead, it showcases the versatility and robustness of the UniROS framework, providing a platform for researchers to extend it further and adapt it to their specific research needs. Therefore, these use cases demonstrate the current capabilities of the framework and lay the groundwork for future research in more complex areas, such as robot-based multi-agent systems, multi-task learning, and meta-learning.
11. ROS1 vs. ROS2 Support in the UniROS Framework
While this study is primarily based on the first major version of the Robot Operating System (ROS 1), it is nearing the end of life (EOL) in 2025, as ROS 2 is emerging as the new standard. Hence, plans to upgrade the package to support ROS 2 are currently underway, starting with ROS 2 Humble (
https://docs.ros.org/en/humble/index.html, accessed on 17 June 2025) and Gazebo Classic (Gazebo 11). However, it is essential not to abandon ROS1, as many robots in the ROS industrial repository and other ROS sensor packages have not yet been upgraded to ROS2, or they remain unstable. Similarly, only the latest robots are currently configured to work with ROS2, while some older generations are either discontinued, not maintained by the manufacturer, or abandoned because the companies have closed down (such as the Baxter robot). Therefore, experimenters may need to wait for the ROS community to upgrade these packages to ROS 2.
Moreover, in terms of the real-time learning strategy proposed in this study, it is worth noting that the design choices made, such as ROS timers, ROS publishers and subscribers, and action repeats, are inherently compatible with both ROS 1 and ROS 2. These components are part of the core middleware tools that remain consistent across both distributions. Consequently, the implementation strategy outlined in this work can be readily adapted to ROS2 with minimal modification, preserving its real-time characteristics and concurrent environment support.
12. Conclusions & Future Work
This study introduced an ROS-based unified framework designed to bridge the gap between simulated and real-world environments in robot-based reinforcement learning. The dual-package approach of the framework, which contains MultiROS for simulations and RealROS for the real world, facilitates learning across simulated and real-world scenarios by using an ROS-centric environment implementation strategy. This implementation strategy facilitates real-time decision-making by leveraging the built-in multi-threading capabilities of ROS for subscribers and timers, enabling the asynchronous and concurrent execution of operations essential for efficient RL workflows. By employing this approach and controlling robots directly through ROS’s low-level control interface, this study has effectively addressed the challenges of ROS-based real-time learning. This was demonstrated using a benchmark learning task, highlighting the low latency and dynamic interaction between the agent and environment.
Furthermore, the OpenAI Gym library has recently been deprecated and replaced by the Gymnasium (
https://gymnasium.farama.org/, accessed on 17 June 2025) library [
61]. Therefore, the UniROS framework has been upgraded to support Gymnasium, including the ROS-based support package for the Gymnasium-based Stable Baselines3 versions. For simplicity, only OpenAI Gym is referenced in the above sections.
Additionally, it is worth noting that all experiments focused on learning a simple Reach task, and training with multiple complex tasks that incorporate external sensors, such as vision sensors, were not explored in this study. This also includes exploring robustness strategies such as noise injection or uncertainty-aware learning to improve reliability under real-world disturbances. Therefore, one potential area for future work is to assess the framework’s capabilities for complex multi-task and meta-learning scenarios across various benchmark task environments. In particular, this would include integrating advanced simulation-to-real transfer techniques, such as domain randomization, dynamic parameter adjustment, and variability-aware training, as they are vital for bridging significant domain gaps in tasks involving more complex perception or contact dynamics. Additionally, we acknowledge that the results reported in this study are based on a single training run. Future work will include an evaluation across multiple random seeds to assess the training robustness, variance, and statistical confidence. This will help further quantify the reliability and generalization capabilities of policies trained within the UniROS framework. Furthermore, future work could investigate how different reward-shaping strategies affect learning performance, enabling better task-specific tuning.
Furthermore, it is essential to acknowledge that the robot was connected to the PC via a wired connection during the experiments, ensuring a stable and reliable communication interface. This was beneficial for conducting experiments in a controlled manner, but it may not accurately reflect the challenges that experimenters may encounter with wireless or less stable connections. Therefore, another area for future work could be to explore the implications of varying connection types on the performance and reliability of the framework. Furthermore, the robot used in this experiment is configured with effort controllers that accept Joint Trajectory messages. However, some robots, such as UR5e, can be configured to work with effort, position, or velocity controllers, each of which has its own advantages and disadvantages, depending on the task requirements and specific capabilities of the robot. Therefore, another area for future work could be to investigate the impact of using different types of controllers to learn the same benchmark tasks. This provides valuable insights into the effects of controller differences on learning performance.
Moreover, this study primarily discussed CPU utilization during learning and interaction with the environment. However, given that many deep RL architectures leverage GPU acceleration, future experiments should explore CPU-GPU co-utilization metrics. Understanding how GPU-based training and inference affect the timing and synchronization of agent-environment loops can further improve the real-time applicability of the framework. Similarly, future work could involve exploring other reinforcement learning algorithms and off-the-shelf RL library frameworks that are better suited for real-time control, thereby further enhancing the performance and reliability of the proposed solution.
While this study focused on manipulation tasks, the proposed framework is not limited to stationary arms. The same ROS-based abstraction can be extended to mobile robots, drones, and other robotic systems or fleets that support ROS interfaces. Therefore, future use cases could include mobile navigation, exploration, and hybrid scenarios involving both mobility and manipulations. Examining the impact of such configurations on learning efficiency, task performance, and system stability would provide valuable insights for designing robust ROS-based multi-robot RL systems.
In conclusion, the proposed ROS-based RL framework addresses the challenges of bridging simulated and real-world RL environments. Its modular design, support for concurrent environments, Python bindings for ROS, and real-time RL environment implementation strategy collectively enhance the efficiency, flexibility, reliability, and scalability of robotic reinforcement learning tasks. The experiments performed with the benchmark task further illustrate the practical applicability of the framework in real-world robotics, showcasing its potential for advancing the field of reinforcement learning. Therefore, we encourage the community to build upon this foundation by exploring more intricate tasks and environments and pushing the boundaries of what is achievable in robotic reinforcement learning.