1. Introduction
Visual servoing [
1,
2] is a control framework that integrates camera-based feedback to refine a robot’s trajectory in real time iteratively. This technique utilizes data from cameras and visual sensors to provide feedback on the robot’s environment and object interaction, enabling precise and adaptive control [
3,
4]. Numerous studies and advances in visual servoing have been conducted so far in domains such as end-effector pose control of a manipulator [
5,
6], grasping [
7,
8,
9,
10,
11], robot navigation [
12,
13], or medical applications [
14,
15,
16]. The visual sensor may be fixed directly on the robot, referred to as the “eye-in-hand” setup, or placed separately in the working environment, known as the “eye-to-hand” setup. This paper focuses on examining the first setup.
Visual servoing implies two main architectures: Image-Based Visual Servoing (IBVS) and Position-Based Visual Servoing (PBVS). The first architecture, IBVS, operates directly in the image space using visual features extracted from camera images to control robot motion [
17]. The primary advantage of IBVS is its robustness to camera calibration errors and environmental changes, as it does not rely on precise 3D models. Instead, IBVS adjusts the robot’s position based on real-time feedback from image features, making it suitable for dynamic and unpredictable environments. However, IBVS’s challenges are its sensitivity to local minima and the complexity of handling large displacements, which can lead to instability and inaccuracies. In contrast, PBVS uses the geometric relationship between the camera and the target object to calculate the robot’s pose in 3D space [
18]. This method relies on an accurate camera calibration and a precise model of the environment to compute control commands. PBVS offers a global view of the task, enabling more predictable and stable control compared to IBVS. However, its dependence on accurate models and calibration makes it less robust to environmental changes and errors in model estimation.
Recent advancements in deep learning have demonstrated significant potential for improving visual feedback control [
8,
9,
10,
11,
19]. Unlike conventional approaches that depend on hand-designed features, deep learning techniques allow systems to autonomously learn and derive features from raw visual data. convolutional neural networks (CNNs) have proven particularly effective when working with images because they can autonomously learn pertinent features specific to a given problem without predefined feature extraction techniques. For example, ref. [
19] presents a training procedure for a CNN to approximate the unknown inverse dynamics of the robot’s inner loop, allowing feedforward compensation to correct tracking error. The network is trained on data gathered from iterative learning control and then refined by transfer learning with real robot data to handle model discrepancies and improve performance. However, existing deep learning-based visual servoing methods often rely on separate processing of synthetic and real data, or they fail to incorporate complementary visual cues (e.g., feature point maps) at the earliest feature extraction layers. As a result, many models struggle with robust velocity estimations, particularly when real-world data are limited or scenes contain substantial texture variation.
A recent study from [
20] demonstrates a 3D vision-based pipeline that integrates YOLOv4 and ArUco markers to enable a collaborative robot arm to classify and manipulate various objects in a dynamic workspace. However, while this framework addresses object identification and grasping, it does not explore how low-level velocity control can be learned or optimized in real time, particularly via a deep learning approach that fuses additional geometric cues at the earliest stages of convolution. Our work addresses this gap by introducing a novel early fusion technique that concatenates feature maps with raw RGB data at the input level, allowing the network to exploit geometric cues from the outset.
To facilitate proper training processes, a variety of neural network models have been created using CNNs that are pretrained on classification tasks, including architectures like AlexNet [
21], VGG-16 [
22], and FlowNet [
23]. According to [
8], the study by Saxena et al. employed FlowNet for executing visual servoing tasks in diverse settings, avoiding the need for prior information on camera specifications or the scene’s geometry. Their approach involved predicting the camera’s pose by entering concatenated images representing the current and the final desired scenes. Bateux et al. [
9] presented neural architectures derived from AlexNet and VGG-16, designed to predict transformations in a camera using two images. These architectures were specifically designed for tasks requiring precise, reliable, real-time six-degrees-of-freedom (DOF) positioning, employing visual feedback. They used a synthetic training dataset to enhance learning efficacy and improve robustness against changes in lighting conditions. Ref. [
10] introduced DEFINet, a Siamese neural architecture designed to extract features using two CNNs that share parameters. Subsequently, these features are input into a regression module to estimate the relative pose between the present and target images acquired by an eye-to-hand camera. Ribeiro et al. [
11] compared three CNN-based architectures for grasp detection, where the neural network provides a 3D pose using two input images depicting the initial and final scene layouts. The initially developed architecture utilizes a unified branch, concatenating images along the depth channel to construct the input array, with one regression block producing all six outputs. While employing the same input format, the second model features dual output branches specifically for position and orientation. In contrast, the third CNN employs distinct feature extractors per input image, concatenates these features, and utilizes a single regression block. Experimental results demonstrated that the first model achieved superior performance with its single-branch configuration. A different approach is proposed in [
24], where the authors propose a novel Mean Similarity Image Measurement Loss function that incorporates image similarity characterized by brightness, contrast, structural differences, and the L1 loss function. The methodology uses a convolutional neural network (CNN) based on the ResNet-152 architecture to predict 6-degree-of-freedom (DOF) pose information from monocular images. The training data are generated using a spherical projection data generator, which ensures uniform data distribution and efficient collection. Experimental results demonstrate that the proposed method achieves higher pose prediction accuracy and robustness against occlusion than traditional methods. Other recent works highlight advanced strategies for 6-DOF robotic control and deep network-based detection tasks. For instance, ref. [
25] presents a revised virtual decomposition plus PD control scheme for robust 6-DOF task-space manipulation. In contrast, ref. [
26] illustrates how deep convolutional neural networks can effectively tackle complex, multiclass image recognition challenges.
This work aims to design and evaluate a hybrid deep learning framework for visual servoing tasks. The proposed approach incorporates additional valuable information into the input layers and utilizes transfer learning with a pretrained CNN for image classification. The chosen architecture employs an early fusion approach, enabling the integration of supplementary data derived from traditional visual servoing methods. In particular, the supplementary data that extend the network’s input arrays are a set of point features. Moreover, we compare a baseline architecture (without early fusion) with our early fusion design to demonstrate that incorporating these feature maps enhances scene generalization and overall robustness in visual servoing.
While our current framework utilizes only visual data for velocity prediction, integrating multisensor data (e.g., depth, IMU, or force/torque) could significantly enhance real-time performance and robustness. For example, ref. [
27] reports that combining visual cues with force feedback leads to sub-millimeter precision in contact-rich assembly tasks, effectively handling scenarios where pure vision might fail due to occlusions or rapid visual changes. Similarly, ref. [
28] demonstrates how fusing onboard cameras with inertial measurements enables a quadrotor to outperform expert pilots in drone racing, underscoring the value of richer sensor inputs in stabilizing high-speed maneuvers. Nonetheless, incorporating multiple sensors raises challenges, including precise calibration, sensor-specific noise modeling, and maintaining low-latency data processing in resource-constrained platforms.
The research introduces a visual servoing system (VS) that processes visual feedback, which was fully trained and tested on a real dataset. The system operates without requiring any 3D model or camera parameter information. Given a desired image, the network predicts the most appropriate velocity commands for the camera. The main contributions of this work are as follows:
Design of a VS framework utilizing an early fusion CNN-based method that employs feature points to access low-level pixel information. This design comprehensively describes the initial and final scenes for all neural inputs.
Construction of a large dataset for end-to-end VS control using a Universal Robot manipulator (UR5) and implementation of the network in a real robot as an extension of [
29].
Implementation of the Deep Learning-based Visual Servoing control law in a real system using a UR5 robot.
3. Dataset Configuration
For the actual system to effectively perform the visual servoing task, it is essential to construct a dataset that accurately reflects the environmental characteristics in which the robot will operate. This dataset must address the task’s requirements and demonstrate enough variability to ensure strong generalization. The data collection scenario employs a UR5 collaborative robot fitted with an “eye-in-hand” setup based on a ZED Mini stereo camera. The camera acquires images at a 1080p resolution with a frame rate of 25 FPS. The setup includes an NVIDIA Xavier AGX with a 512-Core Volta GPU and an 8-Core ARM 64-Bit CPU. This ensures that all our visual servoing pipeline components, including SURF feature point computation and the CNN-based prediction, run in real time at a stable 25 FPS.
Figure 3 presents the real-time application setup used for this work.
Figure 4 shows how this robotic system’s physical and computational layers are unified through ROS (Robot Operating System). The physical layer comprises a ZED Mini camera and a UR5 robot observed in
Figure 3, linked by motion dependency so that any robot movement directly affects the captured visual data. The camera acquires images and sends them through USB to the camera node in the computational layer. At the same time, the robot concurrently transmits its pose data to the robot node via Ethernet.
The application node is the primary central unit within the ROS-based computation layer, utilizing GPU acceleration to handle image and pose data efficiently. This information calculates velocity commands and relays them to the robot node, thereby directing the UR5 robot’s real-time movements. This integrated design ensures precise data acquisition and processing, forming the backbone for robust visual servoing and broader generalization.
To create the dataset, a strategy inspired by [
9] was adopted, in which an entire dataset is synthetically generated from a single reference RGB image and its corresponding pose.
Figure 5 presents the reference pose used in this work, which was acquired with the camera at the pose (
m,
m,
m,
rad,
rad,
rad).
One effective strategy for generating additional camera views around a known reference pose is introducing slight perturbations in translation and rotation. Specifically, the translation components
and rotation components
are drawn from a Gaussian distribution
where
centers the distribution at the reference pose and
determines the variations of the poses. This approach naturally clusters samples around the reference due to the Gaussian peak at zero offset and maintains a smooth distribution over the six dimensions of rigid motion.
To ensure realistic camera movement, the Gaussian parameters must be carefully set. In particular, translational offsets are sampled with a standard deviation [m], while rotational offsets use for and , and for . These values keep the new poses within small but meaningful deviations from the reference. Moreover, incorporating bounds for translation (e.g., ) and rotation guards against implausibly significant shifts while promoting diversity in the synthesized dataset.
Although these offsets arise from a Gaussian, some poses may still be unrealistic for a particular application. One can impose bounding rules on each newly sampled pose to address this. After a perturbation is drawn, the camera’s 3D position and rotation angles are examined to ensure they remain within fixed limits, for instance radians for rotation or m for translation. This validation step ensures that synthetic viewpoints stay within a feasible range, filtering out outliers with excessive displacement or rotation.
Once a sampled pose is obtained, it is checked if it belongs to the specific pose space. In this work, the pose space is defined by m shifts in the x and y directions, with m permitted in z. Likewise, rotational changes are limited at relative to the original orientation. Any pose outside these bounds is discarded and resampled. This mechanism maintains a balanced spread of valid viewpoints without straying into physically unrealistic regions, achieved by the UR5 robot.
When the camera encounters a minor alteration in pose while observing a planar region of the scene, the connection between the original and new images can be described by a homography [
35]. In computer vision, a homography is a projective mapping that transforms points from one image of a planar surface into their counterparts in another image, both captured under a pinhole camera model. This concept underpins key operations, such as image registration and rectification, that align images from differing viewpoints [
36,
37]. Formally, if
and
represent the homogeneous coordinates of corresponding points in the first and second images, respectively, they are related by
where ≈ indicates equality up to a scale factor, consistent with the nature of homogeneous coordinates.
Consider a plane specified by
, where
is the normal and
d is the distance from the camera. The homography for a planar surface in 3D can be written as
where
and
denote the rotation and translation between the two viewpoints. Including the camera intrinsic matrix
yields the final 2D projective transformation,
which fully describes how to warp the original image to approximate the new viewpoint.
The matrix encodes the camera’s internal parameters: horizontal and vertical focal lengths and the principal point . Incorporating ensures consistent projection into the image plane for both the reference and transformed poses. Since variations in focal length or principal point placement can significantly alter how transformations appear in pixel coordinates, applying and on either side of the homography is crucial for accurate image warping.
During the homography transformation, some regions in the warped output may lack corresponding pixels in the source image. To fill these missing points, a background color is estimated using a k-means clustering approach. The reference image is reshaped into a set of pixel-value vectors, and k-means is run (with a small number of clusters, selected empirically) to identify the dominant color. The cluster containing the most significant number of pixels is assumed to represent the background, and its centroid is used for filling. This ensures newly exposed areas match the overall appearance of the scene.
One can generate numerous new images from a single reference by iteratively sampling random perturbations, checking them against spatial and rotational bounds, and computing the resulting homographies. Each warped image corresponds to a distinct but constrained viewpoint near the original pose. Such synthetic datasets are highly valuable for algorithm development in fields like feature tracking, camera calibration, or visual odometry, where coverage of various positions and orientations is beneficial. Algorithm 2 summarizes the main steps explained earlier.
Algorithm 2 Synthetic dataset generation. |
- 1.
INPUT: reference image and pose. - 2.
Sample random perturbations in translation and rotation from . - 3.
Check if the new pose is within predefined bounding limits (translation/rotation). - 4.
If out of bounds, discard and resample. - 5.
If valid, compute the homography and warp the reference image. - 6.
Fill invalid pixels using a dominant background color derived from k-means clustering. - 7.
OUTPUT: warped image and the associated perturbed pose.
|
Homography-based warping efficiently generates training data from a single reference image, significantly reducing the data collection effort. However, it relies on a planar approximation, which may fail under significant viewpoint changes or in scenes with complex 3D structures. Distortions can arise in these cases, leading to misleading feature associations. Likewise, dynamic objects or changing environments are not fully captured, widening the gap between synthetic and real conditions. While this approach streamlines data synthesis, these limitations can restrict the model’s robustness and generalizability in real-world settings. However, diverse lighting conditions can be incorporated into the image generation process, enabling it to accommodate various visual servoing applications. The combination of brightness, contrast, color adjustments, and the addition of realistic effects, such as shadows or noise, allows for a versatile approach to synthetic dataset development.
An example of using Algorithm 2 is illustrated next. Starting from the reference image depicted in
Figure 5, two randomly generated images are shown in
Figure 6. The image on the left was synthetically generated at the pose (
m,
m,
m,
rad,
rad,
rad), while the image on the right was synthetically generated at the pose (
m,
m,
m,
rad,
rad,
rad).
As
Figure 2 shows, the early fusion-based CNN architecture predicts velocities, so a conversion from poses to velocities is necessary to train the network. The velocities were computed by converting the difference between the current and desired robot poses into velocity commands. Specifically, given the initial and desired robot poses represented by Euler angles and positions, homogeneous transformation matrices are first constructed. The relative transformation between these two poses is computed, and the rotational component is extracted using axis–angle representation. Finally, a proportional control law is applied to determine linear and angular velocities, employing a control gain parameter
. Algorithm 3 succinctly summarizes this process, illustrating how the velocity vector
is derived directly from pose information, thus enabling efficient and precise visual servoing control.
Algorithm 3 Pose-to-velocity computation. |
Require: Current pose , Desired pose , control gain
- 1:
Compute rotation matrices from Euler angles: - 2:
Construct homogeneous transformation matrices:
where , - 3:
Compute relative transformation: - 4:
Extract rotation using axis-angle representation:
only if , , - 5:
Compute linear and angular velocities: Ensure: Velocity vector , with the components |
The experimental results are detailed in the following section.
5. Conclusions
This paper presents a novel hybrid deep learning approach to visual servoing control systems. As a main contribution, the approach integrates raw RGB image data with feature point maps using an early fusion architecture. This innovative methodology offers several key advantages over traditional visual servoing techniques, paving the way for more robust and efficient robotic control in dynamic environments. The seamless integration of diverse data sources is a crucial element of this work, improving performance metrics across various aspects of the visual servoing pipeline.
A contribution of this research is the creation of a comprehensive dataset combining real-world data captured using a UR5 robotic arm in an eye-in-hand configuration and a synthetically generated dataset. This dual approach addresses the limitations of relying solely on real or synthetic data. Real-world data capture the complexities and nuances of real-world scenarios, including lighting variations, occlusions, and unforeseen disturbances. However, collecting sufficient real-world data can be time-consuming, expensive, and potentially dangerous. Synthetic data, on the other hand, allow for generating vast amounts of labeled data under controlled conditions, addressing the limitations of real-world data acquisition. Combining both datasets provides a rich and diverse training ground for the deep learning models, leading to improved generalization and robustness.
Two distinct neural network architectures based on the robust ResNet framework were designed and trained using this combined dataset. One architecture directly incorporated additional feature maps into the input layer, while the other processed the RGB and feature map data separately before fusion. A comparative analysis of these architectures demonstrated the superior performance of the early fusion approach, which combines the raw RGB images and feature point maps at the earliest stage of the network’s processing. This early fusion strategy proved particularly effective in enhancing offline prediction accuracy and the convergence speed of the online servoing process. The results indicate the benefits of integrating diverse data sources to achieve superior performance in visual servoing tasks.
Looking ahead, some promising directions exist for extending this work. First, future scenarios will explore a more diverse range of pose configurations, varying significantly in position and orientation. Additionally, testing under varied environmental conditions, such as changing illumination, occlusions, and dynamic backgrounds, will further validate the algorithm’s robustness and generalization capabilities. Another interesting aspect of future work may involve refining the network to predict not just the velocity commands but also the 3D pose of the robot or camera in space. The current results aim to compare the early fusion strategy with a non-fusion baseline, demonstrating the practical benefit of incorporating supplementary feature maps for better scene generalization and robustness. Furthermore, although the existing literature relies on different datasets, we plan to compare our approach to other visual servoing methods in a future extension.
Future research directions include several extensions that could enhance current visual servoing capabilities. A particularly compelling application involves integrating our early fusion visual servoing approach with manipulation tasks, such as the work from [
20]. Combining these two tasks could enable a comprehensive grasping system where the early fusion method guides robots to optimal grasping poses with improved precision. This integration would demonstrate practical applications in industrial assembly, warehouse automation, drone navigation and landing, and service robotics.
Additional research objectives include investigating alternative real-time representations, such as optical flow or depth-based maps, which may provide enhanced contextual information for the neural architecture. The network architecture could be extended to predict velocity commands and 3D pose estimation in a unified framework. While the current results validate the benefits of incorporating supplementary feature maps for improved scene generalization, future work will include comprehensive comparisons with state-of-the-art visual servoing methods using standardized benchmarks to establish the proposed approach’s performance within the broader research landscape.