Teleoperation of Dual-Arm Manipulators via VR Interfaces: A Framework Integrating Simulation and Real-World Control

Torrejón, Alejandro; Eslava, Sergio; Calderón, Jorge; Núñez, Pedro; Bustos, Pablo

doi:10.3390/electronics15030572

Open AccessFeature PaperEditor’s ChoiceArticle

Teleoperation of Dual-Arm Manipulators via VR Interfaces: A Framework Integrating Simulation and Real-World Control

by

Alejandro Torrejón

,

Sergio Eslava

,

Jorge Calderón

,

Pedro Núñez

^*

and

Pablo Bustos

RoboLab, Robotics and Artificial Vision, University of Extremadura, 10003 Cáceres, Spain

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(3), 572; https://doi.org/10.3390/electronics15030572

Submission received: 31 December 2025 / Revised: 22 January 2026 / Accepted: 27 January 2026 / Published: 28 January 2026

(This article belongs to the Special Issue Pushing Boundaries: Innovations in Robotics, Artificial Intelligence, and Extended Reality)

Download

Browse Figures

Versions Notes

Abstract

We present a virtual reality (VR) framework for controlling dual-arm robotic manipulators through immersive interfaces, integrating both simulated and real-world platforms. The system combines the Webots robotics simulator with Unreal Engine 5.6.1 to provide real-time visualization and interaction, enabling users to manipulate each arm’s tool point via VR controllers with natural depth perception and motion tracking. The same control interface is seamlessly extended to a physical dual-arm robot, enabling teleoperation within the same VR environment. Our architecture supports real-time bidirectional communication between the VR layer and both the simulator and hardware, enabling responsive control and feedback. We describe the system design and performance evaluation in both domains, demonstrating the viability of immersive VR as a unified interface for simulation and physical robot control.

Keywords:

virtual reality; teleoperation; dual-arm robotic manipulators; real-time visualization; immersive interfaces

Graphical Abstract

1. Introduction

Teleoperation systems enable humans to control robots remotely, providing essential capabilities in environments that are hazardous, inaccessible, or otherwise unsafe for direct human intervention. This paradigm has become a key enabler in domains such as disaster response [1,2], radioactive material handling [3], underwater exploration [4,5], and medical robotics [6]. These systems enhance human safety and operational efficiency by allowing operators to perform precise manipulations from a distance.

However, conventional teleoperation interfaces often suffer from limited depth perception, restricted situational awareness, and non-intuitive control schemes. These limitations can lead to reduced task performance and operator fatigue [7], particularly when dealing with complex robotic systems, such as dual-arm manipulators. While some recent approaches have explored solutions using miniature replicas of the robot [8] or body motion tracking for teleoperation [9], these methods typically provide limited immersion and lack comprehensive 3D spatial understanding of the remote workspace when the manipulation occurs in a different room or environment. In contrast, immersive technologies, particularly Virtual Reality (VR), can significantly improve spatial understanding and user engagement in teleoperation scenarios by enabling natural 3D interaction, head-tracked viewpoints, and controller-based motion tracking [10].

Despite these advances, integrating VR with real-world robotic systems remains challenging. Many existing frameworks are restricted to simulation environments, lack real-time feedback, or require specialized hardware setups, hindering reproducibility and scalability [11]. Furthermore, synchronization between virtual and physical robots demands robust communication architectures to ensure low latency and reliable control. Although Unity-based solutions are commonly used for such integrations, recent studies have shown that Unreal Engine 5 (UE) provides superior rendering performance, lower latency, and more stable data synchronization for robotics and teleoperation applications [12,13].

In this work, we present a unified VR framework (Framework repository available at https://github.com/alfiTH/VR_teleoperation (accessed on 22 January 2026)) for teleoperating dual-arm robotic manipulators that bridge simulation and physical implementation. Our system combines the Webots robotics simulator with Unreal Engine 5.6.1 to deliver a highly realistic, responsive, and immersive interface. Users can manipulate each arm’s end-effector using VR controllers, seamlessly transitioning between simulated and real robotic platforms without altering the control paradigm. This unified design enables efficient testing, training, and operations within a single VR environment.

Our approach further emphasizes modularity and real-time perception: a middleware abstraction layer decouples the VR application from the underlying robotics framework, and the operator is immersed in a digital twin augmented with a large, colorized point cloud reconstruction of the remote environment, rendered efficiently on the GPU. We validate the proposal through technical performance measurements and a user study with bimanual manipulation tasks, demonstrating the feasibility of immersive VR as a unified interface for simulation and physical robot control.

Figure 1 provides an at-a-glance overview of the proposed immersive teleoperation workflow. An operator, equipped with a consumer VR headset and tracked controllers, commands the end-effectors of a dual-arm system through natural bimanual interaction while immersed in a synchronized digital twin of the remote scene. Crucially, the operator perspective is augmented with a dense 3D reconstruction (colored point cloud), which supports depth perception and collision-aware manipulation during contact-rich tasks such as cloth handling, and exemplifies the real-time feedback our framework provides across both simulation-based development and real robot deployment.

The principal contributions of this paper are as follows:

A unified VR teleoperation architecture that integrates Webots (simulation) and a physical dual-arm robot through the same Unreal Engine 5.6.1 immersive interface.
A modular communication design based on a middleware abstraction (static library) that enables replacing the robotics middleware (e.g., RoboComp/Ice, ROS2) without modifying the Unreal Engine application.
A real-time digital-twin workflow for dual-arm end-effector control using VR motion controllers, including continuous bidirectional synchronization and a safety-oriented dead-man switch mechanism.
A scalable 3D perception visualization pipeline that fuses multi-sensor point clouds and renders >1 M colored points in VR via GPU-based Niagara, preserving interactive performance.
An experimental validation including (i) technical performance benchmarking of point-cloud rendering and (ii) a user study on six bimanual manipulation tasks reporting success and collision metrics.

Beyond its immediate applications in robotic research and industrial automation, our system opens the door to future extensions, such as learning-based teleoperation, guided manipulation via mixed-reality feedback, and multi-operator collaboration. The proposed architecture thus advances VR-based teleoperation as a flexible, unified platform that integrates simulation, real-time perception, and physical robot control within a single immersive interface.

The remainder of this paper is organized as follows: Section 2 reviews recent work on XR-based teleoperation, digital twins, and 3D scene visualization for manipulation. Section 3 presents the proposed system architecture and implementation, including the sim-to-real workflow, middleware abstraction, control loop, and dense point-cloud rendering pipeline. Section 4 reports the technical benchmarks and the user-study results obtained with bimanual manipulation tasks. Finally, Section 5 discusses the main findings, limitations, and directions for future work.

2. Related Works

Immersive and extended-reality (XR) interfaces have emerged as a prominent approach to address long-standing limitations of classical teleoperation (e.g., restricted viewpoint control, weak depth cues, and high cognitive load). A systematic review by Wonsick and Padir highlights the rapid growth of VR interfaces for robot operation enabled by consumer-grade headsets and controllers and discusses recurring design axes such as viewpoint management, control mapping, and scene representation fidelity [14].

Several works since 2020 have investigated VR as a more natural interface for dexterous manipulation. Hetrick et al. compared two VR control paradigms (positional/waypoint-like vs. trajectory/click-and-drag) for teleoperating a Baxter robot and implemented a full VR environment that includes a live point-cloud reconstruction from a depth sensor alongside wrist-camera streams [15]. De Pace et al. evaluated an “Enhanced Virtual Reality” teleoperation approach in which both the robot and the environment are captured with RGB-D sensors, while the remote operator commands the robot’s motion through VR controllers [16]. In industrial contexts, Rosen and Jha report that direct VR arm teleoperation becomes more difficult when low-level torque/velocity interfaces are not exposed, and motion is mediated by proprietary position controllers; they propose filtering VR command signals and demonstrate contact-rich manipulation on industrial manipulators [17]. These studies strongly support VR’s usability benefits but typically focus on a specific robot/control stack and do not target a unified pipeline that seamlessly spans simulation and physical execution through a single immersive interface.

A complementary line of work augments the operator’s perception through digital twins and reconstructed 3D scene cues. GraspLook proposes a VR telemanipulation system that replaces pure camera-based feedback with an augmented virtual environment, using a region-based convolutional neural network (R-CNN) to detect relevant instruments and render their digital twins; user results indicate reduced mental demand and faster execution in tube manipulation [18]. IMMERTWIN introduces a mixed-reality framework built around a closed-loop digital twin, explicitly streaming a live colored point cloud (from multiple ZED cameras) into Unreal Engine 5.4, and reports a 26-participant evaluation across two robot platforms [19]. For bimanual/dual-manipulator settings, García et al. propose an AR interface (HoloLens) combined with a gamepad to teleoperate bimanual industrial manipulators, aiming to reduce learning time and improve ergonomics relative to classic joystick-based interfaces [20]. Gallipoli et al. propose a VR-based dual-mode teleoperation architecture with an explicit mode switch (“Approach” vs. “Telemanipulation”) to support safe and flexible remote manipulation [21]. In parallel, configurable immersive baselines for online 3D reconstruction, such as VRTAB-Map (built on RTAB-Map SLAM), further emphasize the importance—and practical difficulty—of presenting evolving dense 3D reconstructions to operators during teleoperation missions [22]. Despite this progress, scaling dense 3D reconstruction to high point counts while preserving interactive frame rates in VR, and doing so in a framework that remains portable across simulation and real robots, remains an open engineering challenge.

Recent toolkits also emphasize teleoperation as a means to collect high-quality demonstrations for learning-based robotics. OpenVR (SoftwareX 2025) provides an open-source VR teleoperation method using an Oculus headset to control a Franka Panda, explicitly motivated by the cost and difficulty of collecting demonstrations [11]. GELLO proposes a low-cost, kinematically equivalent physical replica controller (3D-printed + off-the-shelf motors) and reports comparisons against common low-cost alternatives such as VR controllers and 3D mice; it also demonstrates complex bimanual/contact-rich tasks [8]. Tung et al. present a VR teleoperation system aimed at collaborative dataset collection, emphasizing immersive stereoscopic egocentric feedback and a taxonomy of human–robot collaborative tasks [23]. While these efforts are highly relevant for data generation, they do not primarily address a unified sim/real immersive digital twin workflow for dual-arm manipulation with high-density, reconstructed 3D scene feedback.

In contrast to prior work, our proposal targets a single immersive interface that (i) unifies simulation and real-robot execution without changing the operator’s interaction paradigm, (ii) emphasizes modular middleware decoupling of the XR front-end from the robotics back-end, and (iii) delivers scalable, GPU-friendly rendering of dense colored point clouds within a synchronized dual-arm digital twin, validated via technical benchmarks and bimanual user tasks.

To facilitate a compact yet readable comparison, we summarize the representative works discussed above in two complementary tables. Table 1 reports the main descriptive attributes of each approach (publication year, XR modality, validation on real hardware and/or simulation, and the type of 3D scene representation). Table 2 preserves the full qualitative commentary by pairing each work with its key aspects and its relation to our proposal. Together, these tables highlight where prior systems focus (e.g., interface design, scene augmentation, or demonstration collection) and clarify the gap our unified sim-to-real dual-arm framework addresses through middleware decoupling and scalable dense point-cloud rendering.

3. System Architecture and Implementation

3.1. System Overview

Figure 2 summarizes the proposed VR teleoperation framework and the data flow through four main layers: the XR interaction layer, the communication layer, the actuator/sensor layer, and the robot/simulator layer.

XR interaction Layer: The operator issues bimanual end-effector commands using a Meta Quest 3 headset and tracked controllers, while Unreal Engine renders the immersive interface. This interface includes the robot’s digital twin, the operator’s phantom grippers, and a dense colored point cloud rendered in real time via Niagara, providing the user with an intuitive control mechanism.
Communication Layer: Unreal Engine exchanges the target end-effector poses and feedback, such as robot state information, with a dedicated C++ abstraction layer, RobotMiddleware. This middleware bridges the XR front-end and the robotics back-end, such as RoboComp/Ice, ensuring seamless data transfer between the user interface and the robot.
Actuator/Sensor Layer: This layer is composed of two main components: the ArmController, which computes joint velocity commands via inverse kinematics (IK) with a safety mechanism (dead-man switch) for the end-effector, and the multi-sensor perception system, which fuses data into a dense colored point cloud.
Robot/Simulator Layer: The final layer consists of the physical robot P3bot or simulator Webots. It receives the commands from the Actuator/Sensor layer and executes them, performing the desired manipulation tasks.

3.2. Robotic Platform and Digital Representations

We implemented our system on a custom dual-arm mobile robot, P3bot (Figure 3a), equipped with two Kinova Gen3 7-DoF manipulators mounted on a mecanum-wheeled base. The sensory suite includes a Robosense Helios LiDAR, a ZED 2i stereo camera, and a Ricoh Theta 360° camera. All onboard computation runs under Ubuntu 24.04 with an Intel Core Ultra 9 Processor 185H and NVIDIA GeForce RTX 4070 Laptop GPU.

To support a unified workflow across simulation and real deployment, we created 3D assets of the robot for both Unreal Engine (Figure 3b) and the Webots simulator (Figure 3c). This alignment between the physical platform and its virtual counterparts enables the operator to interact through the same VR interface while maintaining coherent kinematics and visualization in both simulation and real-world teleoperation.

3.3. XR Runtime and Interaction in Unreal Engine

For immersive teleoperation, a Meta Quest 3 headset is used for Virtual Reality (VR) interaction. The headset connects to the main system via ALVR and SteamVR, which exposes an API interface compatible with Unreal Engine 5.6.1.

3.4. Communication Architecture and Middleware Abstraction

Figure 4 details the software/hardware stack and the main data flows of the proposed teleoperation system. The overall architecture is built upon the RoboComp robotics framework, which internally employs the Ice middleware; however, Unreal Engine 5.6.1 is kept independent of these back-end choices by communicating exclusively through RobotMiddleware, a static C++ library that exposes a standardized API for command and sensing exchange. This abstraction layer can be re-targeted to alternative middleware (e.g., ROS 2) without modifying the UE application. UE acts as the XR front-end, receiving head/hand tracking and controller events through the SteamVR runtime, while the Meta Quest 3 is connected via ALVR.

On the control side, an ArmController module consumes target end-effector commands from UE and produces joint-level commands that are delivered to the robot through dedicated drivers; in particular, one KinovaController instance is used per manipulator to interface the two Kinova Gen3 7-DoF arms. On the perception side, sensor drivers provide the raw data streams used by the VR visualization pipeline: Lidar3D interfaces with the RS Helios LiDAR, RicoComponent interfaces with the Ricoh Theta camera, and ZEDComponent interfaces with the ZED 2i stereo camera. These sensing streams are queried by higher-level modules (e.g., the RGBD360 block) to obtain LiDAR scans and panoramic imagery (“Get LiDAR”/“Get Image 360”), enabling the colored point-cloud reconstruction that is rendered in UE.

3.5. Control Architecture, Kinematics, and Safety

The motion control of the dual-arm system is implemented using a Resolved-Rate Motion Control (RRMC) scheme that leverages the optimization-based solvers provided by the Robotics Toolbox [24]. This approach allows the system to treat teleoperation not merely as a pose-mapping task, but as a multi-constraint optimization problem.

3.5.1. Kinematic Solver and Constraint Handling

The ArmController module receives target end-effector poses (

T_{t a r g e t}

) from Unreal Engine at a rate of 90 Hz through the RobotMiddleware static library. Instead of a conventional analytical inverse kinematics solution, we employ a non-linear damped least-squares (Levenberg-Marquardt) numerical solver. This choice is critical for dual-arm manipulation as it ensures robust behavior near kinematic singularities and allows for the integration of multiple operational constraints:

Joint Limit Avoidance: The solver penalizes joint configurations approaching physical limits.
Self-Collision Proximity: The framework monitors the distance between the two manipulators and between each arm and the robot’s chassis.
Singularity Management: The damping factor in the Levenberg-Marquardt algorithm prevents high joint velocities when the arms are near singular configurations.

3.5.2. Haptic and Visual Feedback Loop

To bridge the gap between the user’s intent and the robot’s physical limits, the system implements a feedback mechanism. If the solver cannot find a valid solution within the admissible workspace (due to an impending collision or singularity), the motion is progressively dampened or halted. In such cases, the RobotMiddleware triggers a haptic vibration in the VR controllers to notify the operator that the commanded pose is unreachable. To resume motion, the operator must steer the controller toward a collision-free direction, allowing the solver to re-converge.

3.5.3. Safety State-Machine and Dead-Man Switch

The safety logic is governed by a Finite State Machine (FSM) comprising three primary states to ensure robust operation under both nominal and fault conditions (see Figure 5):

IDLE State: This is the default standby mode. The robot maintains zero joint velocity and active brakes. The system remains in this state until a dual-concurrence condition is met: a stable heartbeat signal from the RobotMiddleware and the dead-man switch engaged.
ACTIVE State: Once the dead-man switch is pressed, the FSM transitions to this state, enabling the real-time stream of velocity commands ( $\dot{q}$ ) derived from the IK solver. If the operator releases the dead-man switch, the system transitions directly back to IDLE for an immediate but nominal halt.
SAFE-STOP State: This state acts as a dedicated fault-handling mechanism. It is triggered automatically from the ACTIVE state if a communication timeout (>100 ms) or a connection loss is detected. In this mode, the system executes an emergency deceleration profile to neutralize inertia before transitioning back to IDLE once a stable connection is re-established or the system is reset.

3.6. Perception Pipeline and Point-Cloud Rendering

Inside Unreal Engine, the remote workspace is represented as a large colored point cloud, enabling free-viewpoint exploration without tethering the operator’s perspective to the robot’s physical sensor position. The environment point cloud is obtained by fusing RS Helios LiDAR data with color information from the Ricoh camera and subsequently merging it with the point cloud provided by the ZED 2i camera. The resulting colored point cloud is rendered directly on the GPU through Unreal Engine’s Niagara system, enabling real-time visualization of more than 1,000,000 points while preserving interactive performance.

To ensure that this virtual representation accurately reflects the physical state, the system leverages a unified coordinate mapping based on the robot’s rigid structure. Each sensor S is defined by a static transformation matrix

T_{B}^{S}

relative to the robot’s base B, as shown in Equation (1):

T_{B}^{S} = [\begin{matrix} R_{3 \times 3} & t_{3 \times 1} \\ 0_{1 \times 3} & 1 \end{matrix}]

(1)

This mathematical alignment ensures that all incoming data is pre-processed and spatially consistent with the robot’s kinematic model before reaching the RobotMiddleware.

Figure 6 illustrates the resulting point-cloud visualization in Unreal Engine for both operation modes: a simulated scene in Webots and the corresponding real-world reconstruction. This comparison highlights that the same VR visualization pipeline and rendering strategy (Niagara-based GPU point sprites) is preserved across simulation and physical deployment, providing consistent 3D situational awareness to the operator.

The selection of point-cloud rendering over traditional 2D video or real-time mesh reconstruction was driven by the need to balance depth perception with system latency. While 2D streams lack the spatial cues necessary for complex dual-arm coordination, and mesh reconstruction often introduces prohibitive computational overhead for high-density data, point clouds provide the 3D immersion required to estimate distances between manipulators accurately [25]. The use of a colored 3D point cloud is further justified by its ability to facilitate precise navigation with minimal collision risks in complex industrial environments. Although it may involve a higher cognitive load, this modality provides full spatial immersion and captures critical color and geometric information essential for target identification in inspection tasks. Furthermore, this approach leverages the 6-DoF head-tracking of the VR headset, allowing operators to dynamically overcome occlusions and sensor noise through natural head movement and multi-perspective observation.

3.7. Temporal Synchronization and Latency Management

Temporal synchronization is critical for aligning the dynamic movements of dual-arm manipulators with visual feedback. All machines in the local network are synchronized via Precision Time Protocol (PTP), achieving a clock offset of less than one millisecond. This high-precision timing enables the RobotMiddleware to use a timestamp-based buffer, ensuring that the robot’s kinematic joint states are perfectly matched to the corresponding point-cloud frames from the ZED 2i camera.

The ZED 2i serves as the primary sensor for generating the dense point cloud required for precision tasks. We measured an end-to-end latency of approximately 138 ms, defined as the time elapsed from the physical event captured by the ZED 2i sensor to the final image reproduction in the VR headset. Despite this perceptible delay, the system maintains a high update frequency and visual stability. During high-precision tasks, sub-millisecond synchronization of PTP-aligned data ensures that the virtual manipulators and their real-world counterparts are spatially coherent, allowing operators to compensate for visual lag through consistent 3D depth cues. Crucially, the rendering of the robot’s virtual mesh is decoupled from the point-cloud buffer. While the point-cloud representation of the arms is subject to the measured latency, the virtual mesh of the manipulators is updated nearly instantaneously based on direct joint encoder data. This ensures a zero-latency visual response for the operator’s primary movements, providing a predictive reference that facilitates coordination despite the inherent sensing delay.

3.8. Simulation-to-Real Operation

A central design goal of the proposed framework is to preserve the operator interaction paradigm across simulation and real deployment. The same UE VR interface (controller-to-end-effector mapping, visualization, and safety gating) is used with either Webots or the physical P3Bot by switching the backend that receives joint velocity commands and provides state feedback. This sim-to-real consistency supports iterative development, safe testing, and user training in simulation before transferring the same workflows to the real robot.

4. Experimental Results

In this section, we present the experimental results from both technical performance evaluations and user studies. We divided our evaluation into two main areas: (1) a technical experiment involving system performance under various conditions using different point cloud sizes, ranging from 50 K to 2 M points and (2) a comparison of our VR-based teleoperation framework with existing solutions in terms of performance and usability.

4.1. Technical Experiment

To assess the performance of our VR framework, we measured system resource usage (CPU, GPU, RAM, and VRAM), frame rendering time, and frame rate (FPS). The tests were performed using an NVIDIA RTX 3070 GPU with 8 GB of VRAM, a 1 Gbps communication network, and the MetaQuest 3 VR headset with a resolution of 2144 × 2240 per eye. The Niagara system was configured to 20 Hz, with the UR limited to 90 FPS. The tests were conducted with different point cloud sizes (50 K, 500 K, 1 M, 1.5 M, and 2 M points) to assess the scalability and performance of the system in real-time rendering and interaction scenarios.

4.1.1. CPU and GPU Usage

Figure 7a presents the CPU usage statistics for different point cloud sizes. The average CPU usage increases with the number of points, with the highest usage observed for the 2 M point cloud. This demonstrates that CPU usage increases as point cloud density increases, with the system handling up to 2 M points under significant load.

The GPU usage, shown in Figure 7b, also increases as the number of points grows. However, the GPU is less stressed compared to the CPU, with utilization peaking at 81% for the 2 M point cloud. The GPU handles the rendering load efficiently, with a manageable increase in utilization as the point cloud size increases.

4.1.2. Memory Usage (RAM and VRAM)

Figure 8a,b show the memory usage for RAM and VRAM, respectively. As expected, the RAM and VRAM usage increases with larger point clouds.

This shows that the memory consumption increases with the number of points, with VRAM usage being a critical factor at higher point cloud sizes.

4.1.3. Latency and FPS

The frame rendering time (or frame period) and frame rate metrics are crucial for evaluating the responsiveness of the VR environment. As shown in Figure 9a, the rendering period increases as the point cloud size grows, resulting in higher latency. For frames per second (FPS), Figure 9b shows that the system maintains a stable 90 FPS for point clouds up to 1 M points. However, for larger point clouds, the FPS decreases.

The system starts to experience a slight drop in FPS with point clouds exceeding 1 M points, but it still maintains a smooth user experience, with FPS remaining above 70 even for the heaviest point cloud tested. This stability helps mitigate one of the common issues in VR: motion sickness.

4.1.4. System Latency

Figure 10 illustrates the end-to-end latency breakdown of the proposed system. Each processing component is represented as a block, while arrows indicate the corresponding processing and communication delays between system modules.

The ZED stereo camera exhibits a total latency in the range of 86–92 ms, of which only approximately 7 ms corresponds to onboard image processing. The remaining delay is primarily dominated by data transmission and synchronization over the network. At an operating frequency of 20 Hz, each frame carries approximately 64.8 Mbits, making network bandwidth a critical factor in overall system latency. As a result, increasing the available network throughput could significantly reduce this component of the delay.

The impact of point cloud size on communication latency is further quantified in Table 3. As shown, communication latency scales almost linearly with the number of transmitted points, ranging from 14–17 ms for 50 K points (3.6 Mbits) up to 187–193 ms for 2 M points (144 Mbits) on a 1 Gbps network. These results highlight the trade-off between visual fidelity and transmission latency, reinforcing the importance of balancing point cloud resolution with real-time performance requirements.

The LiDAR sensor introduces a latency of 6–8 ms, while the Ricoh camera contributes less than 3 ms. These sensor streams are fused within the RGBD360 module, which adds less than 2 ms of processing latency before forwarding the aggregated data to the middleware.

Regarding actuation, the Kinova manipulators exhibit a state transmission latency below 1 ms, while the reception and execution latency at the controller level is approximately 5 ms through the middleware interface. On the visualization side, data preprocessing for point cloud rendering in the Niagara pipeline requires less than 3 ms, while the VR rendering stage operates within 9–10 ms per frame. This rendering latency is consistent with the frame times reported in the technical evaluation in Section 4.1.3.

Overall, although individual sensors, particularly the stereo camera, introduce non-negligible delays due to high data throughput requirements, the remaining components of the pipeline exhibit low and predictable latencies. The cumulative system latency remains within the bounds required for real-time VR teleoperation, enabling responsive interaction, stable visual feedback, and effective closed-loop control.

4.2. Real-World Tests: Teleoperation Evaluation

We conducted seven tests with 17 participants, consisting of 7 participants with experience in VR gaming and 10 without. The age and sex distribution of the participants is shown in Figure 11. The goal was to evaluate the usability and ease of control of our VR teleoperation system. The 17 participants performed 6 tasks (Video of the tasks available at https://www.youtube.com/playlist?list=PLUJmKmCuxO5El45JNI8rR4xMEMVZMZEt2 (accessed on 22 January 2026)), in the same order and sequentially.

Prior to the tests, participants were given a self-paced adaptation period to familiarize themselves with the system, which did not exceed 300 s.

4.2.1. Pick and Place

In this task, participants were required to grasp a pencil and move it from one cup to another, as illustrated in Figure 12.

As shown in Figure 13, participants with prior VR experience achieved a lower mean completion time and exhibited a more compact distribution compared to participants without VR experience. This suggests that familiarity with VR environments positively influences early task performance, particularly for simple grasping and placement actions.

4.2.2. Stacking Cubes

In this task, participants were asked to build a tower using three cubes, as illustrated in Figure 14.

As shown in Figure 15, participants lacking VR experience exhibited a slight reduction in both the relative dispersion and the mean during stacking actions. In contrast, participants with VR experience preserved a similar mean time between both actions but showed a notable reduction in dispersion during the second stacking step. This indicates improved consistency as users adapted to the bimanual manipulation required by the task.

4.2.3. Toy Handover

In this task, participants were required to pick up a toy with one manipulator, transfer it to the other manipulator, and place it in a designated location, as illustrated in Figure 16.

As shown in Figure 17, participants with VR experience achieved significantly lower median completion times and reduced dispersion compared to those without VR experience, particularly during the handover phase. However, no significant differences were observed between the two groups when placing the toy on the right side of the table, suggesting that the drop action posed minimal difficulty regardless of prior VR familiarity.

4.2.4. Folding Cloth

In this task, participants were required to fold a piece of cloth by moving both arms synchronously, as illustrated in Figure 18.

As shown in Figure 19, completion times were similar for both participant groups. This outcome may be attributed to the inherently bimanual and synchronized nature of the task, which requires coordinated motion of both robotic arms and limits the advantage typically provided by prior VR experience.

4.2.5. Writing “HI”

In this task, participants were required to grasp a pen and write “HI” on the table surface, as illustrated in Figure 20.

As shown in Figure 21, participants with VR experience demonstrated a more concentrated distribution when grasping the pen. However, during the writing phase, this group exhibited a slightly higher mean completion time compared to participants without VR experience. As a result, the overall task completion times for both groups were comparable. This may indicate that fine motor control during precise writing movements is influenced more by task-specific adaptation than by general VR familiarity.

4.2.6. Erasing “HI”

In this task, participants were required to erase the written “HI” from the table using a sponge, as illustrated in Figure 22.

As shown in Figure 23, participants without prior VR experience unexpectedly achieved lower completion times than those with VR experience. This result may be explained by the simplicity and repetitive nature of the erasing action, which relies less on spatial awareness and more on continuous motion, thereby reducing the impact of prior VR exposure.

4.2.7. Summary Success Rate

Table 4 summarizes the task success rates and collision metrics. In terms of task completion, all participants, particularly those with VR experience, completed the tasks with high success rates, especially in tasks involving simpler interactions, such as Toy Handover (94.11% success rate). On the other hand, tasks that required more precise and coordinated actions, such as Pick and Place (76.47%) and Cube Stacking (76.47%), showed slightly lower success rates, with some collisions occurring between the manipulators and the environment.

This performance disparity is primarily explained by the physical characteristics of the targets, the order of the experiments, and the visual constraints of the reconstruction. The Pick and Place and Cube Stacking were the initial tasks performed by the participants. In the former, the pencil was positioned vertically inside a cup; its diameter (1.2 cm) and the limited height protruding from the container (2.3 cm) required high precision, increasing the likelihood of environmental collisions during the approach. In the Cube Stacking task, the recorded timeouts (3 cases) occurred as participants struggled to align the cubes due to shadows projected by the robot’s arms, which distorted depth perception in the point cloud. In contrast, the Erasing task achieved a 100% success rate. This final task involved a high-contrast sponge and a goal that—much like the Toy Handover—did not require high precision. It is important to note that although all participants with prior VR experience completed the entire battery of tests, some inexperienced users faced difficulties in the most demanding scenarios. Those with VR experience also demonstrated superior performance in terms of completion time. These results suggest that while tasks with narrow tolerances and small graspable surfaces may require more operator experience, the system remains highly intuitive for general manipulation where objects are visually distinct and accessible.

4.2.8. Post-Test User Feedback

A post-experiment subjective questionnaire was administered to assess user experience, usability, and perceived performance of the VR teleoperation system. Responses were collected using a 5-point Likert scale, where higher values indicate stronger agreement and lower values indicate stronger disagreement:

The overall experience was comfortable in terms of hardware fit, weight, cables, and controllers.
Overall comfort received consistently high ratings (Figure 24), with most participants scoring between 4 and 5. This indicates that the hardware setup (headset fit, weight, controllers, and cables) was well tolerated during the experiments.
I felt a strong sense of presence, as if I were present in the robot’s environment or controlling the robot through its perspective.
Similarly to the previous question, the sense of presence was rated positively (Figure 25), with the majority of participants reporting a strong feeling of being present in the robot’s environment or controlling the robot through its perspective. These results suggest that the immersive visualization and motion mapping effectively support embodiment in the teleoperation task.
I did not experience symptoms of motion sickness (e.g., dizziness, nausea, or headache) during or after the experiment.
The system demonstrated very good tolerance to motion sickness. As shown in Figure 26, the vast majority of participants reported no symptoms at all, selecting the maximum score on the scale. Only three participants reported mild discomfort, while none reported moderate or severe motion sickness symptoms.
The mapping between my manual movements in VR and the movements of the robotic manipulator was intuitive.
The intuitiveness of the mapping between user hand movements and robotic manipulator motions was rated highly, with most participants scoring 4 or 5 (Figure 27). This suggests that the control scheme was easy to understand and required minimal cognitive effort.
I was able to perform precise movements (e.g., picking up small objects or inserting them) with the accuracy I desired.
Perceived precision of manipulation received slightly more varied responses (Figure 28). While many users felt capable of performing precise actions, some participants—particularly those without prior VR experience—reported moderate difficulty. This indicates that fine manipulation tasks may require additional adaptation time or enhanced visual feedback.
Coordinating the two manipulators simultaneously to perform the task was easy.
Simultaneous coordination of the two manipulators was identified as one of the more challenging aspects of the system. Although several participants rated this capability positively (Figure 29), a noticeable portion reported moderate difficulty.
I did not perceive noticeable latency or delay between my actions and the robot’s response or visual updates.
Perceived system latency was generally rated as low. Most participants did not notice significant delays between their actions and the robot’s response or visual updates (Figure 30). This subjective perception is consistent with the low frame rendering times measured in the technical evaluation. However, a small number of participants reported perceiving latency that was attributed not to communication or rendering delays but rather to the intentionally limited physical motion speed of the robotic manipulators. As a critical safety measure, the arm joints were capped at a maximum velocity of 1.5 rad/s. Since this hardware speed is lower than a user’s natural hand movement in VR, some participants interpreted the safety-driven motion limit as system latency, even though data transmission and command execution remained near real-time.
The virtual environment and point cloud visualization provided sufficient detail to judge distances, object sizes, and textures.
The quality of the virtual environment and point cloud visualization received mixed-to-positive ratings (Figure 31). While a majority of participants reported that the visualization provided sufficient detail to accurately judge distances and object sizes, several users identified limitations related to object occlusion and shadowing effects caused by the robotic arms.
I felt confident when performing complex manipulations, trusting the visual feedback and the robot’s response.
User confidence during complex manipulations followed a similar trend (Figure 32). Most participants reported feeling confident when performing tasks, though confidence was slightly lower in tasks requiring high precision or bimanual coordination.
Considering ease of use and performance, this VR teleoperation system is useful for tasks in real or hazardous environments.
Finally, the system was rated as highly useful for teleoperation tasks in real or hazardous environments. The majority of participants selected values of 4 or 5 (Figure 33), indicating strong perceived applicability of the system beyond the experimental setting.

The participants provided valuable insights into their experiences with the VR teleoperation system. Based on the feedback, we have identified several strengths of the system and areas that require further refinement.

Strengths

Low Latency and Precision: Several users highlighted the system’s low latency, which contributed to a smooth and responsive experience. The movements of the robotic arms, particularly the grippers, were praised for their precision and their ability to accurately follow the user’s hand orientation. This precision was especially notable in tasks that required fine manipulation, such as Writingand Folding.
Immersion and Visual Feedback: Many participants reported a strong sense of immersion and presence in the virtual environment. The system’s ability to provide realistic visual feedback, including the synchronized movement of the robot’s arms with the user’s hands, was cited as a key strength. Users felt they could easily estimate depth and object positioning.
Intuitiveness and Ease of Use: A recurring theme in the feedback was the intuitiveness of the system. Participants, regardless of their VR experience, found the controls to be easy to learn and use. The system’s simple and effective control scheme, which allows simultaneous manipulation of both arms, was appreciated for its straightforwardness, especially by those with no prior VR experience.

Areas for Improvement

Shadows and Perception Issues: A significant number of participants pointed out that the shadows generated by the robot’s arms were problematic. These shadows often distorted their perception of the objects and the task environment, making it difficult to accurately position and manipulate objects, particularly in tasks requiring high precision. Many participants suggested that improving shadow handling or providing clearer visual cues for depth and object positioning would greatly enhance the experience.

In summary, participants found the VR teleoperation system to be immersive, intuitive, and responsive, with low latency and well-synchronized hand-arm movements. However, several areas for improvement were identified, particularly in terms of shadow handling. Addressing these issues will enhance the user experience, especially for tasks that require precise manipulation and depth perception.

4.3. Comparison with Related Work

Compared with related works such as IMMERTWIN [19], our system achieves competitive performance using more modest hardware, as summarized in Table 5. Specifically, our framework operates on an NVIDIA RTX 3070 GPU while generating a 1 M-point cloud at 20 Hz, whereas IMMERTWIN relies on an RTX 4090 to render a 1.6 M-point cloud at 10 Hz. Furthermore, while IMMERTWIN utilizes two fixed ZED 2i cameras to optimize the field of view within a static tabletop scenario, our system maintains high-performance real-time rendering in a mobile robot configuration. This mobility, combined with our higher update frequency, provides a more versatile solution for dynamic environments, even if direct task-based comparisons are influenced by their different sensor setups.

Compared to GELLO [8], our approach demonstrates higher success rates in manipulation tasks such as handover and folding cloth when using the VR-based teleoperation mode. Furthermore, our system shows superior overall performance compared to OpenVR (SoftwareX) [11], which may be partially attributed to implementation differences. While OpenVR is developed using C# with Unity and Python components, our framework is implemented entirely in standard C++23, emphasizing vectorized container-based iterations and avoiding unnecessary memory copies, resulting in improved computational efficiency and lower latency.

5. Conclusions

This paper presented a VR-based teleoperation framework for dual-arm robotic manipulation, integrating immersive visualization through real-time point cloud rendering and intuitive motion mapping. The system was evaluated through both objective technical measurements and a comprehensive user study.

The technical evaluation demonstrated that the framework maintains high frame rates and low rendering latency under local network conditions. However, the analysis of end-to-end latency reveals a critical trade-off between visual density and operational safety. According to recent studies on humanoid teleoperation, latencies exceeding 250 ms significantly increase motion sickness risk, while high-precision tasks ideally require a response time below 150 ms [26]. Our system, operating at 2 million points, approaches these limits (∼190 ms). This marks a clear applicability boundary: while highly effective in high-speed local networks, its deployment over standard long-distance internet protocols may introduce jitter and artifacts, potentially reducing task safety and user comfort.

Beyond latency, specific failure scenarios were identified, such as self-occlusion during dual-arm coordination and shadowing effects. While our findings indicate that operators effectively develop compensatory strategies—leveraging 6-DoF head-tracking to gain multiple perspectives and ’see through’ occlusions—this reliance on human dexterity represents a potential cognitive bottleneck. As suggested in related teleoperation studies [27], high data density and complex control scenarios can significantly increase cognitive load. However, our task-based performance analysis showed that while prior VR experience reduces completion times during initial interactions, performance levels eventually converge as task complexity increases. This suggests that the system’s intuitiveness and the natural mapping of the VR interface allow even novice users to overcome initial hurdles and manage the cognitive demands of high-density point cloud environments effectively.

The user study results further indicate that the system offers high levels of comfort and presence. Users perceived the control as responsive, and the majority considered the system suitable for hazardous environments. Despite the identified limitations in visual clarity during fine manipulation, the results confirm the scalability of the proposed architecture.

Future developments will focus on improving visual feedback through automated viewpoint optimization and adaptive assistance mechanisms for bimanual coordination. Further studies in real-world deployment scenarios will validate the system’s robustness under degraded network conditions. Overall, the proposed system is an effective and scalable solution for immersive robotic manipulation, bridging the gap between intuitive human control and robust real-time robotic performance.

Author Contributions

Conceptualization, A.T. and S.E.; methodology, P.B.; software, A.T., S.E. and J.C.; validation, A.T. and P.N.; formal analysis, A.T.; investigation, A.T., S.E., J.C. and P.B.; resources, A.T.; data curation, A.T.; writing—original draft preparation, P.N.; writing—review and editing, A.T. and P.N.; visualization, A.T. and P.N.; supervision, A.T. and P.N.; project administration, P.N.; funding acquisition, P.B. and P.N. All authors have read and agreed to the published version of the manuscript.

Funding

This work was co-funded by the FEDER Project 0124_EUROAGE_MAS_4_E under the POCTEP Programme 2021–2027, and by the R&D&i project PID2022-137344OB-C31, supported by MICIU/AEI/10.13039/501100011033 and “FEDER, Una manera de hacer Europa”. Additional co-funding was provided by the European Union through the European Regional Development Fund (85%) and by the Junta de Extremadura. The managing authority is the Ministerio de Hacienda (Spain). Grant GR24194.

Institutional Review Board Statement

All subjects gave their informed consent for inclusion before they participated in the study. Ethics approval is not required for this type of study, as it is non-interventional, with no personally identifiable information (PII) collected and data processed anonymously. The study was conducted following the local legislation: Spanish Organic Law 3/2018 (LOPDGDD) on the Protection of Personal Data and Guarantee of Digital Rights (https://www.boe.es/eli/es/lo/2018/12/05/3/con (accessed on 22 January 2026)).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study. Participants were informed of the research’s purpose and the anonymity of their responses prior to completing the questionnaire. Submission of the survey was considered as implied consent to participate.

Data Availability Statement

Framework repository available at https://github.com/alfiTH/VR_teleoperation (accessed on 22 January 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

IK	Inverse Kinematic
PTP	Precision Time Protocol
R-CNN	Region-based Convolutional Neural Network
RRMC	Resolved-Rate Motion Control
UE	Unreal Engine 5.6.1
VR	Virtual Reality
XR	Extended Reality

References

Tadokoro, S. (Ed.) Disaster Robotics, 1st ed.; Springer Tracts in Advanced Robotics; Springer: Cham, Switzerland, 2019; p. 534. [Google Scholar] [CrossRef]
Yoshinada, H.; Kurashiki, K.; Kondo, D.; Nagatani, K.; Kiribayashi, S.; Fuchida, M.; Tanaka, M.; Yamashita, A.; Asama, H.; Shibata, T.; et al. Dual-Arm Construction Robot with Remote-Control Function. In Disaster Robotics: Results from the ImPACT Tough Robotics Challenge; Tadokoro, S., Ed.; Springer International Publishing: Cham, Switzerland, 2019; pp. 195–264. [Google Scholar] [CrossRef]
Nagatani, K.; Kiribayashi, S.; Okada, Y.; Otake, K.; Yoshida, K.; Tadokoro, S.; Nishimura, T.; Yoshida, T.; Koyanagi, E.; Fukushima, M.; et al. Emergency response to the nuclear accident at the Fukushima Daiichi Nuclear Power Plants using mobile rescue robots. J. Field Robot. 2013, 30, 44–63. [Google Scholar] [CrossRef]
Phillips, B.T.; Becker, K.P.; Kurumaya, S.; Galloway, K.C.; Whittredge, G.; Vogt, D.M.; Teeple, C.B.; Rosen, M.H.; Pieribone, V.A.; Gruber, D.F.; et al. A Dexterous, Glove-Based Teleoperable Low-Power Soft Robotic Arm for Delicate Deep-Sea Biological Exploration. Sci. Rep. 2018, 8, 14779. [Google Scholar] [CrossRef] [PubMed]
Jakuba, M.V.; German, C.R.; Bowen, A.D.; Whitcomb, L.L.; Hand, K.; Branch, A.; Chien, S.; McFarland, C. Teleoperation and robotics under ice: Implications for planetary exploration. In Proceedings of the 2018 IEEE Aerospace Conference, Big Sky, MT, USA, 3–10 March 2018; pp. 1–14. [Google Scholar] [CrossRef]
Das, R.; Baishya, N.J.; Bhattacharya, B. A review on tele-manipulators for remote diagnostic procedures and surgery. CSI Trans. ICT 2023, 11, 31–37. [Google Scholar] [CrossRef]
Sam, Y.T.; Hedlund-Botti, E.; Natarajan, M.; Heard, J.; Gombolay, M. The Impact of Stress and Workload on Human Performance in Robot Teleoperation Tasks. IEEE Trans. Robot. 2024, 40, 4725–4744. [Google Scholar] [CrossRef]
Wu, P.; Shentu, Y.; Yi, Z.; Lin, X.; Abbeel, P. GELLO: A General, Low-Cost, and Intuitive Teleoperation Framework for Robot Manipulators. In Proceedings of the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Abu Dhabi, United Arab Emirates, 14–18 October 2024; IEEE: Piscataway, NJ, USA, 2024. [Google Scholar] [CrossRef]
Stanton, C.; Bogdanovych, A.; Ratanasena, E. Teleoperation of a humanoid robot using full-body motion capture, example movements, and machine learning. In Proceedings of the 2012 Australasian Conference on Robotics and Automation, Wellington, New Zealand, 3–5 December 2012. [Google Scholar]
Kanazawa, K.; Sato, N.; Morita, Y. Considerations on interaction with manipulator in virtual reality teleoperation interface for rescue robots. In Proceedings of the 2023 32nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), Busan, Republic of Korea, 28–31 August 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 386–391. [Google Scholar]
George, A.; Bartsch, A.; Barati Farimani, A. OpenVR: Teleoperation for manipulation. SoftwareX 2025, 29, 102054. [Google Scholar] [CrossRef]
Bao, M.; Tao, Z.; Wang, X.; Liu, J.; Sun, Q. Comparative Performance Analysis of Rendering Optimization Methods in Unity Tuanjie Engine, Unity Global and Unreal Engine. In Proceedings of the 2024 IEEE Smart World Congress (SWC), Nadi, Fiji, 2–7 December 2024; pp. 1627–1632. [Google Scholar] [CrossRef]
Soni, L.; Kaur, A. Merits and Demerits of Unreal and Unity: A Comprehensive Comparison. In Proceedings of the 2024 International Conference on Computational Intelligence for Green and Sustainable Technologies (ICCIGST), Vijayawada, India, 18–19 July 2024; pp. 1–5. [Google Scholar] [CrossRef]
Wonsick, M.; Padir, T. A Systematic Review of Virtual Reality Interfaces for Controlling and Interacting with Robots. Appl. Sci. 2020, 10, 9051. [Google Scholar] [CrossRef]
Hetrick, R.; Amerson, N.; Kim, B.; Rosen, E.; de Visser, E.J.; Phillips, E. Comparing Virtual Reality Interfaces for the Teleoperation of Robots. In Proceedings of the 2020 Systems and Information Engineering Design Symposium (SIEDS), Charlottesville, VA, USA, 24 April 2020; IEEE: Piscataway, NJ, USA, 2020. [Google Scholar] [CrossRef]
De Pace, F.; Manuri, F.; Sanna, A. Leveraging Enhanced Virtual Reality Methods and Environments for Efficient, Intuitive, and Immersive Teleoperation of Robots. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; IEEE: Piscataway, NJ, USA, 2021. [Google Scholar] [CrossRef]
Rosen, E.; Jha, D.K. A Virtual Reality Teleoperation Interface for Industrial Robot Manipulators. arXiv 2023, arXiv:2305.10960. [Google Scholar] [CrossRef]
Ponomareva, P.; Trinitatova, D.; Fedoseev, A.; Kalinov, I.; Tsetserukou, D. GraspLook: A VR-based Telemanipulation System with R-CNN-driven Augmentation of Virtual Environment. In Proceedings of the 2021 20th International Conference on Advanced Robotics (ICAR), Ljubljana, Slovenia, 6–10 December 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 166–171. [Google Scholar] [CrossRef]
Audonnet, F.P.; Ramirez-Alpizar, I.G.; Aragon-Camarasa, G. IMMERTWIN: A Mixed Reality Framework for Enhanced Robotic Arm Teleoperation. arXiv 2024, arXiv:2409.08964. [Google Scholar] [CrossRef]
García, A.; Solanes, J.E.; Muñoz, A.; Gracia, L.; Tornero, J. Augmented Reality-Based Interface for Bimanual Robot Teleoperation. Appl. Sci. 2022, 12, 4379. [Google Scholar] [CrossRef]
Gallipoli, M.; Buonocore, S.; Selvaggio, M.; Fontanelli, G.A.; Grazioso, S.; Di Gironimo, G. A virtual reality-based dual-mode robot teleoperation architecture. Robotica 2024, 42, 1935–1958. [Google Scholar] [CrossRef]
Stedman, H.; Kocer, B.B.; Kovac, M.; Pawar, V.M. VRTAB-Map: A Configurable Immersive Teleoperation Framework with Online 3D Reconstruction. In Proceedings of the 2022 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct), Singapore, 17–21 October 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 104–110. [Google Scholar] [CrossRef]
Tung, Y.S.; Luebbers, M.B.; Roncone, A.; Hayes, B. Stereoscopic Virtual Reality Teleoperation for Human Robot Collaborative Dataset Collection. In Proceedings of the HRI 2024 Workshop on Virtual, Augmented, and Mixed Reality for Human-Robot Interaction (VAM-HRI), Boulder, CO, USA, 11 March 2024. [Google Scholar]
Corke, P.; Haviland, J. Not your grandmother’s toolbox–the Robotics Toolbox reinvented for Python. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 11357–11363. [Google Scholar]
Mazeas, D.; Namoano, B. Study of Visualization Modalities on Industrial Robot Teleoperation for Inspection in a Virtual Co-Existence Space. Virtual Worlds 2025, 4, 17. [Google Scholar] [CrossRef]
Darvish, K.; Penco, L.; Ramos, J.; Cisneros, R.; Pratt, J.; Yoshida, E.; Ivaldi, S.; Pucci, D. Teleoperation of humanoid robots: A survey. IEEE Trans. Robot. 2023, 39, 1706–1727. [Google Scholar] [CrossRef]
Turco, E.; Castellani, C.; Bo, V.; Pacchierotti, C.; Prattichizzo, D.; Baldi, T.L. Reducing Cognitive Load in Teleoperating Swarms of Robots through a Data-Driven Shared Control Approach. In Proceedings of the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Abu Dhabi, United Arab Emirates, 14–18 October 2024; pp. 4731–4738. [Google Scholar] [CrossRef]

Figure 1. Teaser of the proposed VR teleoperation framework for dual-arm manipulation. The operator uses an immersive VR headset and tracked controllers (left) to command a physical dual-arm robot (right). The VR interface includes a synchronized digital twin augmented with a dense colored point cloud (bottom-left), providing real-time spatial context to support depth perception and safe bimanual interaction during tasks such as cloth manipulation.

Figure 2. System overview of the proposed VR teleoperation framework.

Figure 3. P3bot platform and its virtual counterparts used in the proposed sim-to-real pipeline: (a) real robot, (b) Unreal Engine digital model, and (c) Webots simulation model.

Figure 4. Communication and component overview of the proposed system. The UE interacts with the robotics back-end only through RobotMiddleware (static library), which bridges the XR application with RoboComp/Ice modules. Driver components (red) interface the physical devices (green): two Kinova Gen3 arms (via KinovaController), RS Helios LiDAR (via Lidar3D), Ricoh Theta camera (via RicoComponent), and ZED 2i stereo camera (via ZEDComponent). The Meta Quest 3 headset is connected through ALVR and exposed to UE via SteamVR.

Figure 5. State machine logic for the safety dead-man switch and fault-recovery protocol.

Figure 6. Point-cloud rendering in Unreal Engine using the proposed GPU-based Niagara pipeline. (a) Webots’ simulation mode, where the reconstructed scene is visualized together with the robot’s digital model. (b) Real-robot operation, showing the colored 3D reconstruction of the workspace integrated into the same immersive interface, providing depth cues and spatial context for manipulation.

Figure 7. Processor units measurement of Unreal points rendering (a) CPU thread usage. (b) GPU usage.

Figure 8. Memory measurement of Unreal points rendering (a) RAM. (b) VRAM.

Figure 9. Time measurement of Unreal points rendering: (a) frame period; (b) frames per second.

Figure 10. End-to-end latency breakdown of the proposed VR teleoperation system.

Figure 11. Age and gender distribution of the 17 participants.

Figure 12. Sequence of the first task: Pick and Place.

Figure 13. Boxplot for the Pick and Place task comparing participants without and with prior VR experience. The orange line indicates the median completion time.

Figure 14. Sequence of the second task: Cube Stacking.

Figure 15. Box plot showing times for stacking the first and second cubes. Stacking Cube 1: grasping the right cube and stacking it on the center cube. Stacking Cube 2: grasping the left cube and stacking it on top of Cube 1. The orange line indicates the median value, and circles denote outliers.

Figure 16. Sequence of the third task: Toy Handover.

Figure 17. Box plot showing times for toy handover and placement. Handover: grasping the toy and transferring it to the other hand. Drop: releasing the toy on the right side of the table. The orange line indicates the median value, and circles denote outliers.

Figure 18. Sequence of the fourth task: Cloth Folding.

Figure 19. Box plot showing times for picking up and folding the cloth. Take the cloth both arms grasp the cloth in preparation for folding. Fold the cloth: both arms synchronously fold the cloth. The orange line indicates the median value, and circles denote outliers.

Figure 20. Sequence of the fifth task: Writing “HI”.

Figure 21. Box plot showing times for taking the pen and writing on the table. Take the pen: the dominant arm grasps and lifts the pen from the table. Writing: the dominant arm writes “HI” on the table surface. The orange line indicates the median value, and circles denote outliers.

Figure 22. Sequence of sixth task erase the written “HI”.

Figure 23. Box plot showing times for erasing the text on the table. The orange line indicates the median value.

Figure 24. Overall comfort survey results.

Figure 25. Sense of presence survey results.

Figure 26. Motion sickness survey results.

Figure 27. Movement mapping intuitiveness survey results.

Figure 28. Precision of manipulation survey results.

Figure 29. Dual manipulator coordination survey results.

Figure 30. Latency survey results.

Figure 31. Visual environment quality survey results.

Figure 32. Confidence in complex tasks survey results.

Figure 33. Usefulness for real/Hazardous tasks survey results.

Table 1. XR and teleoperation-related works (mostly 2020+): modality and setup. “N/S” denotes not specified.

Work	Year	XR	Real	Sim	3D Scene
Wonsick & Padir (Appl. Sci.) [14]	2020	VR	N/A	N/A	N/A
Hetrick et al. (SIEDS) [15]	2020	VR	Yes	No	Point cloud + cameras
De Pace et al. (ICRA) [16]	2021	VR	Yes	N/S	RGB-D capture
Ponomareva et al. (ICAR) [18]	2021	VR	N/S	N/S	Digital twins (R-CNN)
Stedman et al. (ISMAR Adjunct) [22]	2022	VR	Yes	N/S	Online 3D recon
García et al. (Appl. Sci.) [20]	2022	AR	Yes	No	Holographic overlays
Rosen & Jha (arXiv) [17]	2023	VR	Yes	No	N/S
Tung et al. (VAM-HRI) [23]	2024	VR	Yes	N/S	Stereo 2D video
Gallipoli et al. (Robotica) [21]	2024	VR	Yes	N/S	Digital twin (N/S)
Wu et al. (IROS) [8]	2024	–	Yes	No	No
George et al. (SoftwareX) [11]	2025	VR	Yes	No	N/S
Ours	2025	VR	Yes	Yes	Dense colored point cloud

Table 2. XR and teleoperation-related works (mostly 2020+): key aspects and their specific relation to our proposed unified framework.

Work	Key Aspects	Relation to Our Proposal
Wonsick & Padir [14]	Systematic review of design dimensions for VR robot operation.	Motivates the use of XR interfaces but lacks a unified sim-to-real implementation.
Hetrick et al. [15]	Comparison of VR control paradigms with live point-cloud reconstruction.	Focuses on specific tasks (Baxter) but not on engine-agnostic sim/real unification.
De Pace et al. [16]	RGB-D environment capture with VR-controller teleoperation.	Strong perceptual focus, but does not address dual-arm unified sim/real workflows.
Ponomareva et al. [18]	Augmented virtual environment with object detection and digital twin.	Specific to workload reduction, lacking a general dual-arm sim/real architecture.
Stedman et al. [22]	Configurable baseline using RTAB-Map for online 3D reconstruction.	Highlights data density challenges, primarily for mobile robot platforms.
García et al. [20]	AR (HoloLens) and gamepad for bimanual industrial teleoperation.	Addresses ergonomics but lacks immersive VR with high-density point-cloud integration.
Rosen & Jha [17]	Command filtering for industrial arms with black-box controllers.	Complementary approach, whereas we focus on the unified architecture and perception.
Tung et al. [23]	Stereoscopic egocentric feedback for dataset collection.	Lacks a dual-arm digital-twin framework with dense 3D real-time reconstruction.
Gallipoli et al. [21]	Dual-mode architecture for safety and workflow flexibility.	Related in safety intent but does not feature GPU-accelerated point-cloud rendering.
Wu et al. [8]	Low-cost kinematic replica for demonstration collection.	Efficient for contact-rich tasks but lacks immersive digital-twin perception.
George et al. [11]	Open-source VR teleoperation for Franka Panda via Oculus.	Aimed at demonstrations; does not provide a unified interface for simulation and hardware.
Ours	Unified UE5 interface for Webots and physical dual-arm control.	Provides a modular middleware abstraction with scalable GPU-based 3D reconstruction.

Table 3. Communication latency as a function of point cloud size over a 1 Gbps network.

Number of Points	Communication Latency	Data Size
50 K	14–17 ms	3.6 Mbits
500 K	76–80 ms	36 Mbits
1 M	91–98 ms	72 Mbits
1.5 M	140–151 ms	108 Mbits
2 M	184–196 ms	144 Mbits

Table 4. Task success rates and collision metrics.

Task	Success Rate (%)	Self-Collision Count	Environment Collision Count	Timeout
Pick and Place (Pencil)	76.47	1	2	1
Cube Stacking	76.47	1	0	3
Toy Handover	94.11	0	0	1
Cloth Folding	88.24	1	1	0
Writing	82.35	1	0	2
Erasing	100.0	0	0	0

Table 5. Performance comparison: IMMERTWIN vs. proposed framework.

Metric	IMMERTWIN [19]	Proposed Framework
GPU Hardware	NVIDIA RTX 4090	NVIDIA RTX 3070
Sensors	2× StereoLab ZED 2i (Fixed)	Robosense Helios LiDAR + StereoLab ZED 2i (Mobile)
Point Cloud	1.6 M points	1.0 M points
Update Freq.	10 Hz	20 Hz
Deployment	Static setup	Mobile Robot
FPS	40	90

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Torrejón, A.; Eslava, S.; Calderón, J.; Núñez, P.; Bustos, P. Teleoperation of Dual-Arm Manipulators via VR Interfaces: A Framework Integrating Simulation and Real-World Control. Electronics 2026, 15, 572. https://doi.org/10.3390/electronics15030572

AMA Style

Torrejón A, Eslava S, Calderón J, Núñez P, Bustos P. Teleoperation of Dual-Arm Manipulators via VR Interfaces: A Framework Integrating Simulation and Real-World Control. Electronics. 2026; 15(3):572. https://doi.org/10.3390/electronics15030572

Chicago/Turabian Style

Torrejón, Alejandro, Sergio Eslava, Jorge Calderón, Pedro Núñez, and Pablo Bustos. 2026. "Teleoperation of Dual-Arm Manipulators via VR Interfaces: A Framework Integrating Simulation and Real-World Control" Electronics 15, no. 3: 572. https://doi.org/10.3390/electronics15030572

APA Style

Torrejón, A., Eslava, S., Calderón, J., Núñez, P., & Bustos, P. (2026). Teleoperation of Dual-Arm Manipulators via VR Interfaces: A Framework Integrating Simulation and Real-World Control. Electronics, 15(3), 572. https://doi.org/10.3390/electronics15030572

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Teleoperation of Dual-Arm Manipulators via VR Interfaces: A Framework Integrating Simulation and Real-World Control

Abstract

1. Introduction

2. Related Works

3. System Architecture and Implementation

3.1. System Overview

3.2. Robotic Platform and Digital Representations

3.3. XR Runtime and Interaction in Unreal Engine

3.4. Communication Architecture and Middleware Abstraction

3.5. Control Architecture, Kinematics, and Safety

3.5.1. Kinematic Solver and Constraint Handling

3.5.2. Haptic and Visual Feedback Loop

3.5.3. Safety State-Machine and Dead-Man Switch

3.6. Perception Pipeline and Point-Cloud Rendering

3.7. Temporal Synchronization and Latency Management

3.8. Simulation-to-Real Operation

4. Experimental Results

4.1. Technical Experiment

4.1.1. CPU and GPU Usage

4.1.2. Memory Usage (RAM and VRAM)

4.1.3. Latency and FPS

4.1.4. System Latency

4.2. Real-World Tests: Teleoperation Evaluation

4.2.1. Pick and Place

4.2.2. Stacking Cubes

4.2.3. Toy Handover

4.2.4. Folding Cloth

4.2.5. Writing “HI”

4.2.6. Erasing “HI”

4.2.7. Summary Success Rate

4.2.8. Post-Test User Feedback

4.3. Comparison with Related Work

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI