Hybrid Deep Learning Framework for Eye-in-Hand Visual Control Systems

Botezatu, Adrian-Paul; Iancu, Andrei-Iulian; Burlacu, Adrian

doi:10.3390/robotics14050066

Open AccessArticle

Hybrid Deep Learning Framework for Eye-in-Hand Visual Control Systems

by

Adrian-Paul Botezatu

,

Andrei-Iulian Iancu

and

Adrian Burlacu

^*

Faculty of Automatic Control and Computer Engineering, “Gheorghe Asachi” Technical University of Iasi, D. Mangeron 27, 700050 Iasi, Romania

^*

Author to whom correspondence should be addressed.

Robotics 2025, 14(5), 66; https://doi.org/10.3390/robotics14050066

Submission received: 19 March 2025 / Revised: 8 May 2025 / Accepted: 12 May 2025 / Published: 19 May 2025

(This article belongs to the Special Issue Visual Servoing-Based Robotic Manipulation)

Download

Browse Figures

Versions Notes

Abstract

This work proposes a hybrid deep learning-based framework for visual feedback control in an eye-in-hand robotic system. The framework uses an early fusion approach in which real and synthetic images define the training data. The first layer of a ResNet-18 backbone is augmented to fuse interest-point maps with RGB channels, enabling the network to capture scene geometry better. A manipulator robot with an eye-in-hand configuration provides a reference image, while subsequent poses and images are generated synthetically, removing the need for extensive real data collection. The experimental results reveal that this enriched input representation significantly improves convergence accuracy and velocity smoothness compared to a baseline that processes real images alone. Specifically, including feature point maps allows the network to discriminate crucial elements in the scene, resulting in more precise velocity commands and stable end-effector trajectories. Thus, integrating additional, synthetically generated map data into convolutional architectures can enhance the robustness and performance of the visual servoing system, particularly when real-world data gathering is challenging. Unlike existing visual servoing methods, our early fusion strategy integrates feature maps directly into the network’s initial convolutional layer, allowing the model to learn critical geometric details from the very first stage of training. This approach yields superior velocity predictions and smoother servoing compared to conventional frameworks.

Keywords:

visual servoing; deep learning; early fusion; robot vision

1. Introduction

Visual servoing [1,2] is a control framework that integrates camera-based feedback to refine a robot’s trajectory in real time iteratively. This technique utilizes data from cameras and visual sensors to provide feedback on the robot’s environment and object interaction, enabling precise and adaptive control [3,4]. Numerous studies and advances in visual servoing have been conducted so far in domains such as end-effector pose control of a manipulator [5,6], grasping [7,8,9,10,11], robot navigation [12,13], or medical applications [14,15,16]. The visual sensor may be fixed directly on the robot, referred to as the “eye-in-hand” setup, or placed separately in the working environment, known as the “eye-to-hand” setup. This paper focuses on examining the first setup.

Visual servoing implies two main architectures: Image-Based Visual Servoing (IBVS) and Position-Based Visual Servoing (PBVS). The first architecture, IBVS, operates directly in the image space using visual features extracted from camera images to control robot motion [17]. The primary advantage of IBVS is its robustness to camera calibration errors and environmental changes, as it does not rely on precise 3D models. Instead, IBVS adjusts the robot’s position based on real-time feedback from image features, making it suitable for dynamic and unpredictable environments. However, IBVS’s challenges are its sensitivity to local minima and the complexity of handling large displacements, which can lead to instability and inaccuracies. In contrast, PBVS uses the geometric relationship between the camera and the target object to calculate the robot’s pose in 3D space [18]. This method relies on an accurate camera calibration and a precise model of the environment to compute control commands. PBVS offers a global view of the task, enabling more predictable and stable control compared to IBVS. However, its dependence on accurate models and calibration makes it less robust to environmental changes and errors in model estimation.

Recent advancements in deep learning have demonstrated significant potential for improving visual feedback control [8,9,10,11,19]. Unlike conventional approaches that depend on hand-designed features, deep learning techniques allow systems to autonomously learn and derive features from raw visual data. convolutional neural networks (CNNs) have proven particularly effective when working with images because they can autonomously learn pertinent features specific to a given problem without predefined feature extraction techniques. For example, ref. [19] presents a training procedure for a CNN to approximate the unknown inverse dynamics of the robot’s inner loop, allowing feedforward compensation to correct tracking error. The network is trained on data gathered from iterative learning control and then refined by transfer learning with real robot data to handle model discrepancies and improve performance. However, existing deep learning-based visual servoing methods often rely on separate processing of synthetic and real data, or they fail to incorporate complementary visual cues (e.g., feature point maps) at the earliest feature extraction layers. As a result, many models struggle with robust velocity estimations, particularly when real-world data are limited or scenes contain substantial texture variation.

A recent study from [20] demonstrates a 3D vision-based pipeline that integrates YOLOv4 and ArUco markers to enable a collaborative robot arm to classify and manipulate various objects in a dynamic workspace. However, while this framework addresses object identification and grasping, it does not explore how low-level velocity control can be learned or optimized in real time, particularly via a deep learning approach that fuses additional geometric cues at the earliest stages of convolution. Our work addresses this gap by introducing a novel early fusion technique that concatenates feature maps with raw RGB data at the input level, allowing the network to exploit geometric cues from the outset.

To facilitate proper training processes, a variety of neural network models have been created using CNNs that are pretrained on classification tasks, including architectures like AlexNet [21], VGG-16 [22], and FlowNet [23]. According to [8], the study by Saxena et al. employed FlowNet for executing visual servoing tasks in diverse settings, avoiding the need for prior information on camera specifications or the scene’s geometry. Their approach involved predicting the camera’s pose by entering concatenated images representing the current and the final desired scenes. Bateux et al. [9] presented neural architectures derived from AlexNet and VGG-16, designed to predict transformations in a camera using two images. These architectures were specifically designed for tasks requiring precise, reliable, real-time six-degrees-of-freedom (DOF) positioning, employing visual feedback. They used a synthetic training dataset to enhance learning efficacy and improve robustness against changes in lighting conditions. Ref. [10] introduced DEFINet, a Siamese neural architecture designed to extract features using two CNNs that share parameters. Subsequently, these features are input into a regression module to estimate the relative pose between the present and target images acquired by an eye-to-hand camera. Ribeiro et al. [11] compared three CNN-based architectures for grasp detection, where the neural network provides a 3D pose using two input images depicting the initial and final scene layouts. The initially developed architecture utilizes a unified branch, concatenating images along the depth channel to construct the input array, with one regression block producing all six outputs. While employing the same input format, the second model features dual output branches specifically for position and orientation. In contrast, the third CNN employs distinct feature extractors per input image, concatenates these features, and utilizes a single regression block. Experimental results demonstrated that the first model achieved superior performance with its single-branch configuration. A different approach is proposed in [24], where the authors propose a novel Mean Similarity Image Measurement Loss function that incorporates image similarity characterized by brightness, contrast, structural differences, and the L1 loss function. The methodology uses a convolutional neural network (CNN) based on the ResNet-152 architecture to predict 6-degree-of-freedom (DOF) pose information from monocular images. The training data are generated using a spherical projection data generator, which ensures uniform data distribution and efficient collection. Experimental results demonstrate that the proposed method achieves higher pose prediction accuracy and robustness against occlusion than traditional methods. Other recent works highlight advanced strategies for 6-DOF robotic control and deep network-based detection tasks. For instance, ref. [25] presents a revised virtual decomposition plus PD control scheme for robust 6-DOF task-space manipulation. In contrast, ref. [26] illustrates how deep convolutional neural networks can effectively tackle complex, multiclass image recognition challenges.

This work aims to design and evaluate a hybrid deep learning framework for visual servoing tasks. The proposed approach incorporates additional valuable information into the input layers and utilizes transfer learning with a pretrained CNN for image classification. The chosen architecture employs an early fusion approach, enabling the integration of supplementary data derived from traditional visual servoing methods. In particular, the supplementary data that extend the network’s input arrays are a set of point features. Moreover, we compare a baseline architecture (without early fusion) with our early fusion design to demonstrate that incorporating these feature maps enhances scene generalization and overall robustness in visual servoing.

While our current framework utilizes only visual data for velocity prediction, integrating multisensor data (e.g., depth, IMU, or force/torque) could significantly enhance real-time performance and robustness. For example, ref. [27] reports that combining visual cues with force feedback leads to sub-millimeter precision in contact-rich assembly tasks, effectively handling scenarios where pure vision might fail due to occlusions or rapid visual changes. Similarly, ref. [28] demonstrates how fusing onboard cameras with inertial measurements enables a quadrotor to outperform expert pilots in drone racing, underscoring the value of richer sensor inputs in stabilizing high-speed maneuvers. Nonetheless, incorporating multiple sensors raises challenges, including precise calibration, sensor-specific noise modeling, and maintaining low-latency data processing in resource-constrained platforms.

The research introduces a visual servoing system (VS) that processes visual feedback, which was fully trained and tested on a real dataset. The system operates without requiring any 3D model or camera parameter information. Given a desired image, the network predicts the most appropriate velocity commands for the camera. The main contributions of this work are as follows:

Design of a VS framework utilizing an early fusion CNN-based method that employs feature points to access low-level pixel information. This design comprehensively describes the initial and final scenes for all neural inputs.
Construction of a large dataset for end-to-end VS control using a Universal Robot manipulator (UR5) and implementation of the network in a real robot as an extension of [29].
Implementation of the Deep Learning-based Visual Servoing control law in a real system using a UR5 robot.

2. Hybrid Deep Learning Visual Servoing

To enable visual feedback control, camera-based visual information is used to determine the desired motion of a 6-DOF robotic manipulator. The proposed approach incorporates features commonly suggested by traditional visual servoing methods and reconfigures them for direct integration into the neural network’s input layers. A key step in this process is converting the visual data into two-dimensional feature maps that match the size of the input RGB images. By leveraging these feature maps, differences between the initial and final scenes are highlighted, thereby guiding the CNN’s feature extractor to focus on regions most likely to yield valuable high-level features on the real-time visual servoing system.

2.1. Feature Point Maps

In our previous work [29], three approaches to compute additional visual features were used, classified by offering different levels of details:

Feature points may be classified as low-level information, as they specifically identify precise areas within the image where objects of interest are located.
Segmented regions can be considered as mid-level information, offering a broader understanding of an object’s location and attributes by partitioning the image into distinct areas.
Image moments can be regarded as high-level information by summarizing the distribution of pixel intensities within an image. This summary can estimate the object’s pose and separate linear from angular camera velocities.

The experimental results from our previous paper [29] indicated that the best results were obtained with feature points and segmented regions. Therefore, for the real-time implementation, feature points are considered additional information for the early fusion-based architecture. The experimental results on two scenes demonstrate the advantages of using SURF interest points instead of BRISK, mainly due to the enhanced scale invariance and more robust descriptors that SURF provides, leading to more consistent and reliable keypoint information.

Enhancing the input data to the convolutional neural network involves incorporating information derived from keypoint extractors. These operators are designed to identify key locations in an image while maintaining robustness to scale, rotation, and image quality variations. The resulting interest points, typically obtained from camera images, can be used to compute the discrepancy between the robot’s current and desired positions relative to a target object or feature.

The proposed approach examines the effect of early fusion on the performance of the SURF (Speed-Up Robust Features) [30,31] operator. SURF was selected for its demonstrated efficiency, robustness against noise, and rapid computation. Spatial information about object locations within the scene can be learned by employing a map of detected SURF points. SURF achieves this by approximating the Hessian matrix using integral images, facilitating fast and reliable feature detection. Incorporating these interest point variations from the initial to the final frame can provide additional context for a CNN, ultimately enhancing its ability to discern differences between frames and improving its overall performance.

Algorithm 1 details the procedure for generating additional maps, where detected feature points are represented with higher grayscale values than the surrounding pixels. The process begins by converting the input RGB images to grayscale and applying a preexisting SURF feature detection algorithm. The identified points are then transferred into a two-dimensional map that aligns with the early fusion template proposed in this work. Their neighborhoods are defined using a Gaussian filter with a size

f_{s}

and a standard deviation of size

{f_{s}}_{d}

. Finally, the resulting map is produced by combining the neighborhoods surrounding all the feature points. Figure 1 provides an example of SURF detections in a scene containing multiple objects. Here, the maps were generated with

f_{s} = 73

, approximately 10% of the smallest dimension of the original images, and

{f_{s}}_{d} = 5

. This augmented information helps guide the CNN to more effectively discern differences between the initial and final configurations, ultimately enhancing the accuracy of the regression task.

Algorithm 1 Generation of feature point maps using SURF.

Require: An RGB image I (e.g.,

I_{1}

or

I_{2}

)

1:: Convert I to grayscale, producing $I_{g}$ .
2:: Apply a feature detection algorithm (e.g., SURF) to $I_{g}$ to obtain feature points $p_{i}$ , for $i = 1, \dots, n$ .
3:: Construct a Gaussian kernel K of size $f_{s}$ with standard deviation ${f_{s}}_{d}$ .
4:: For each feature point $p_{i}$ , create a map $I_{g_{i}}$ of the same dimensions as $I_{g}$ . Each $I_{g_{i}}$ is zero everywhere except in the neighborhood of $p_{i}$ , which is defined by the kernel K centered at $p_{i}$ .
5:: Form the final map A (of the same size as $I_{g}$ ) by taking the pixel-wise maximum across all $I_{g_{i}}$ :

$A (x, y) = max_{i = 1, \dots, n} I_{g_{i}} (x, y) .$

Ensure: Feature point map A (e.g.,

A_{1}

or

A_{2}

)

2.2. Cnn-Based Visual Servoing Control

Traditional approaches to visual feedback control often rely on precisely crafted features or geometric computations. In contrast, the proposed framework adopts an early fusion strategy that integrates SURF feature point maps with RGB images, overcoming the focus on standard visual inputs. This methodology enhances the contextual information available to the neural network, improving the robustness and accuracy for specific visual servoing applications. The key element of the proposed approach involves integrating additional contextual data directly with RGB data. By performing this integration early on, the model learns how visual information correlates with control commands, leading to a more unified and precise control mechanism.

The designed architecture proposed in this work is illustrated in Figure 2, where the input tensors are constructed by concatenating the following arrays along the depth dimension:

$I_{1}$ , an $M \times N \times 3$ RGB image illustrating the original scene setup.
$I_{2}$ , an $M \times N \times 3$ RGB image illustrating the desired scene setup.
$A_{1}$ and $A_{2}$ , each of size $M \times N$ , which are additional feature maps derived from $I_{1}$ and $I_{2}$ , respectively. These maps provide supplementary information that can enhance the learning process.

Figure 2. Hybrid deep learning framework.

Given that the additional information is feature point maps, the resulting array will be of size

M \times N \times 8

, because the concatenation is performed along the depth dimension. Therefore, each pixel in the final array will incorporate RGB values and the derived feature points. This direct concatenation enables a neural network to access visual and interest-point information simultaneously, eliminating the need for separate input channels. This early fusion method also ensures against possible inaccuracies in detecting feature points since the original images

I_{1}

and

I_{2}

are retained in the input, allowing the CNN to glean essential information directly.

The expanded input tensors, which have a greater depth to accommodate the extra maps, then pass through a feature extraction module composed of convolutional and pooling layers. This module’s primary role is to generate a condensed representation of the input, facilitating the subsequent computation of linear and angular velocities. While the details of feature extraction and fully connected layers can vary, they should be tailored to the network’s input and output dimensions. A common practice is to adopt a CNN trained initially for image classification, such as ResNet [32], and adapt its architecture for regression tasks. In this work, ResNet-18 is modified by adjusting its final layers to handle regression outputs and, depending on the type of additional maps, the first convolutional layer is revised to account for the increased number of input channels.

An advantage of the proposed framework lies in its promise for real-time functionality. The modifications introduced by the proposed framework do not substantially increase the complexity of the deep model but instead focus on altering the input arrays. The SURF-incorporated maps require two auxiliary maps to depict the initial and final scenes, each having dimensions

M \times N

and pixel intensities ranging from 0 to 255. While such expansions might raise concerns about training stability, in [33] the authors demonstrate that early fusion often yields superior performance in multimodal convolutional networks, provided the new channel weights are carefully initialized. In our experiments, convergence remains comparable to the standard 3-channel baseline: the pretrained filters help the model adapt quickly without destabilizing early training. Thus, the benefits of fusing SURF features at the input layer outweigh any minor overhead introduced by reinitializing the first layer.

A further modification arises in the first convolutional layer since pretrained networks assume RGB inputs and possess filters of size

W_{F \times F \times 3}

, where F is the filter dimension. These filters expand to

W_{F \times F \times 8}^{*}

in the proposed framework. While the original weights can be used as an initialization, it is essential to account for the relevance of the newly incorporated maps. Specifically, the channels associated with

I_{1}

and

I_{2}

can inherit

W_{F \times F \times 3}

. In contrast, the remaining two channels (used with feature point maps) are initialized through a grayscale conversion from the first convolutional layer weights of the pretrained neural network:

W_{F}^{*} = 0.299 W_{R} + 0.587 W_{G} + 0.114 W_{B},

(1)

where

W_{F}^{*}

represents the updated filter obtained by merging the red (

W_{R}

), green (

W_{G}

), and blue (

W_{B}

) channels of

W_{F \times F \times 3}

. These parameters are further fine-tuned during training to align with the regression objective.

Within the fully connected block, all layers from the pretrained model can be preserved except for the last layer, which must contain six neurons, each corresponding to an estimated camera velocity. Additionally, the activation function in the last layer must permit positive and negative outputs, thus excluding certain nonlinearities such as ReLU. By way of illustration, ResNet-18 layers are incorporated into the architecture shown in Figure 2, which results in a SURF feature map-based early fusion architecture.

While this paper adopts an early fusion strategy that concatenates feature maps with RGB channels at the input layer, alternative approaches such as late fusion or Siamese architectures also exist. Late fusion typically processes separate data streams (e.g., SURF vs. RGB) in parallel and merges them only in deeper layers [34]. In contrast, Siamese models often employ two identical branches for feature extraction, which are then merged [10]. Compared to these methods, the proposed early fusion design exposes the convolutional filters to raw RGB data and feature point cues from the outset. This direct integration can reduce architectural complexity and allows the network to learn unified feature representations earlier in the training.

3. Dataset Configuration

For the actual system to effectively perform the visual servoing task, it is essential to construct a dataset that accurately reflects the environmental characteristics in which the robot will operate. This dataset must address the task’s requirements and demonstrate enough variability to ensure strong generalization. The data collection scenario employs a UR5 collaborative robot fitted with an “eye-in-hand” setup based on a ZED Mini stereo camera. The camera acquires images at a 1080p resolution with a frame rate of 25 FPS. The setup includes an NVIDIA Xavier AGX with a 512-Core Volta GPU and an 8-Core ARM 64-Bit CPU. This ensures that all our visual servoing pipeline components, including SURF feature point computation and the CNN-based prediction, run in real time at a stable 25 FPS. Figure 3 presents the real-time application setup used for this work.

Figure 4 shows how this robotic system’s physical and computational layers are unified through ROS (Robot Operating System). The physical layer comprises a ZED Mini camera and a UR5 robot observed in Figure 3, linked by motion dependency so that any robot movement directly affects the captured visual data. The camera acquires images and sends them through USB to the camera node in the computational layer. At the same time, the robot concurrently transmits its pose data to the robot node via Ethernet.

The application node is the primary central unit within the ROS-based computation layer, utilizing GPU acceleration to handle image and pose data efficiently. This information calculates velocity commands and relays them to the robot node, thereby directing the UR5 robot’s real-time movements. This integrated design ensures precise data acquisition and processing, forming the backbone for robust visual servoing and broader generalization.

To create the dataset, a strategy inspired by [9] was adopted, in which an entire dataset is synthetically generated from a single reference RGB image and its corresponding pose. Figure 5 presents the reference pose used in this work, which was acquired with the camera at the pose (

- 0.46224

m,

- 0.45865

m,

0.74603

m,

0.07905

rad,

- 0.01007

rad,

- 0.84502

rad).

One effective strategy for generating additional camera views around a known reference pose is introducing slight perturbations in translation and rotation. Specifically, the translation components

(t_{x}, t_{y}, t_{z})

and rotation components

(r_{x}, r_{y}, r_{z})

are drawn from a Gaussian distribution

X \sim N (μ, σ^{2}),

(2)

where

μ = 0

centers the distribution at the reference pose and

σ^{2}

determines the variations of the poses. This approach naturally clusters samples around the reference due to the Gaussian peak at zero offset and maintains a smooth distribution over the six dimensions of rigid motion.

To ensure realistic camera movement, the Gaussian parameters must be carefully set. In particular, translational offsets are sampled with a standard deviation

σ_{t} = 0.01

[m], while rotational offsets use

σ_{r} = 2^{\circ}

for

r_{x}

and

r_{y}

, and

σ_{r} = 4^{\circ}

for

r_{z}

. These values keep the new poses within small but meaningful deviations from the reference. Moreover, incorporating bounds for translation (e.g.,

\pm σ

) and rotation guards against implausibly significant shifts while promoting diversity in the synthesized dataset.

Although these offsets arise from a Gaussian, some poses may still be unrealistic for a particular application. One can impose bounding rules on each newly sampled pose to address this. After a perturbation is drawn, the camera’s 3D position and rotation angles are examined to ensure they remain within fixed limits, for instance

\pm α

radians for rotation or

\pm δ

m for translation. This validation step ensures that synthetic viewpoints stay within a feasible range, filtering out outliers with excessive displacement or rotation.

Once a sampled pose is obtained, it is checked if it belongs to the specific pose space. In this work, the pose space is defined by

\pm 0.1

m shifts in the x and y directions, with

\pm 0.15

m permitted in z. Likewise, rotational changes are limited at

\pm 45^{\circ}

relative to the original orientation. Any pose outside these bounds is discarded and resampled. This mechanism maintains a balanced spread of valid viewpoints without straying into physically unrealistic regions, achieved by the UR5 robot.

When the camera encounters a minor alteration in pose while observing a planar region of the scene, the connection between the original and new images can be described by a homography [35]. In computer vision, a homography is a projective mapping that transforms points from one image of a planar surface into their counterparts in another image, both captured under a pinhole camera model. This concept underpins key operations, such as image registration and rectification, that align images from differing viewpoints [36,37]. Formally, if

x_{1}

and

x_{2}

represent the homogeneous coordinates of corresponding points in the first and second images, respectively, they are related by

x_{2} \approx H x_{1},

(3)

where ≈ indicates equality up to a scale factor, consistent with the nature of homogeneous coordinates.

Consider a plane specified by

n^{T} X + d = 0

, where

n

is the normal and d is the distance from the camera. The homography for a planar surface in 3D can be written as

H_{plane} = R_{rel} + \frac{t_{rel} n^{T}}{d},

(4)

where

R_{rel}

and

t_{rel}

denote the rotation and translation between the two viewpoints. Including the camera intrinsic matrix

K

yields the final 2D projective transformation,

H = K (R_{rel} + \frac{t_{rel} n^{T}}{d}) K^{- 1},

(5)

which fully describes how to warp the original image to approximate the new viewpoint.

The matrix

K

encodes the camera’s internal parameters: horizontal and vertical focal lengths

(f_{x}, f_{y})

and the principal point

(c_{x}, c_{y})

. Incorporating

K

ensures consistent projection into the image plane for both the reference and transformed poses. Since variations in focal length or principal point placement can significantly alter how transformations appear in pixel coordinates, applying

K

and

K^{- 1}

on either side of the homography is crucial for accurate image warping.

During the homography transformation, some regions in the warped output may lack corresponding pixels in the source image. To fill these missing points, a background color is estimated using a k-means clustering approach. The reference image is reshaped into a set of pixel-value vectors, and k-means is run (with a small number of clusters, selected empirically) to identify the dominant color. The cluster containing the most significant number of pixels is assumed to represent the background, and its centroid is used for filling. This ensures newly exposed areas match the overall appearance of the scene.

One can generate numerous new images from a single reference by iteratively sampling random perturbations, checking them against spatial and rotational bounds, and computing the resulting homographies. Each warped image corresponds to a distinct but constrained viewpoint near the original pose. Such synthetic datasets are highly valuable for algorithm development in fields like feature tracking, camera calibration, or visual odometry, where coverage of various positions and orientations is beneficial. Algorithm 2 summarizes the main steps explained earlier.

Algorithm 2 Synthetic dataset generation.

1.: INPUT: reference image and pose.
2.: Sample random perturbations in translation and rotation from $N (0, σ^{2})$ .
3.: Check if the new pose is within predefined bounding limits (translation/rotation).
4.: If out of bounds, discard and resample.
5.: If valid, compute the homography $H$ and warp the reference image.
6.: Fill invalid pixels using a dominant background color derived from k-means clustering.
7.: OUTPUT: warped image and the associated perturbed pose.

Homography-based warping efficiently generates training data from a single reference image, significantly reducing the data collection effort. However, it relies on a planar approximation, which may fail under significant viewpoint changes or in scenes with complex 3D structures. Distortions can arise in these cases, leading to misleading feature associations. Likewise, dynamic objects or changing environments are not fully captured, widening the gap between synthetic and real conditions. While this approach streamlines data synthesis, these limitations can restrict the model’s robustness and generalizability in real-world settings. However, diverse lighting conditions can be incorporated into the image generation process, enabling it to accommodate various visual servoing applications. The combination of brightness, contrast, color adjustments, and the addition of realistic effects, such as shadows or noise, allows for a versatile approach to synthetic dataset development.

An example of using Algorithm 2 is illustrated next. Starting from the reference image depicted in Figure 5, two randomly generated images are shown in Figure 6. The image on the left was synthetically generated at the pose (

- 0.38871

m,

- 0.41068

m,

0.66408

m,

0.13634

rad,

0.19467

rad,

- 0.97939

rad), while the image on the right was synthetically generated at the pose (

- 0.466311

m,

- 0.41514

m,

0.75626

m,

- 0.04307

rad,

- 0.041247

rad,

- 0.75348

rad).

As Figure 2 shows, the early fusion-based CNN architecture predicts velocities, so a conversion from poses to velocities is necessary to train the network. The velocities were computed by converting the difference between the current and desired robot poses into velocity commands. Specifically, given the initial and desired robot poses represented by Euler angles and positions, homogeneous transformation matrices are first constructed. The relative transformation between these two poses is computed, and the rotational component is extracted using axis–angle representation. Finally, a proportional control law is applied to determine linear and angular velocities, employing a control gain parameter

λ

. Algorithm 3 succinctly summarizes this process, illustrating how the velocity vector

v_{c}

is derived directly from pose information, thus enabling efficient and precise visual servoing control.

Algorithm 3 Pose-to-velocity computation.

Require: Current pose

P_{c u r r} = [x_{c}, y_{c}, z_{c}, ϕ_{c}, θ_{c}, ψ_{c}]

, Desired pose

P_{d e s} = [x_{d}, y_{d}, z_{d}, ϕ_{d}, θ_{d}, ψ_{d}]

, control gain

λ

1:: Compute rotation matrices from Euler angles:

$R_{c u r r} \leftarrow R_{z} (ψ_{c}) R_{y} (θ_{c}) R_{x} (ϕ_{c}), R_{d e s} \leftarrow R_{z} (ψ_{d}) R_{y} (θ_{d}) R_{x} (ϕ_{d})$
2:: Construct homogeneous transformation matrices:

$T_{c u r r} \leftarrow [\begin{matrix} R_{c u r r} & p_{c} \\ 0 & 1 \end{matrix}], T_{d e s} \leftarrow [\begin{matrix} R_{d e s} & p_{d} \\ 0 & 1 \end{matrix}],$

where $p_{c} = {[x_{c}, y_{c}, z_{c}]}^{T}$ , $p_{d} = {[x_{d}, y_{d}, z_{d}]}^{T}$
3:: Compute relative transformation:

$T_{Δ} = T_{c u r r}^{- 1} T_{d e s} = [\begin{matrix} R_{Δ} & t_{h} \\ 0 & 1 \end{matrix}]$
4:: Extract rotation using axis-angle representation:

$θ = {cos}^{- 1} (\frac{trace (R_{Δ}) - 1}{2}), u = \frac{1}{2 sin θ} [\begin{matrix} R_{Δ} (3, 2) - R_{Δ} (2, 3) \\ R_{Δ} (1, 3) - R_{Δ} (3, 1) \\ R_{Δ} (2, 1) - R_{Δ} (1, 2) \end{matrix}],$

only if $trace (R_{Δ}) \neq 0$ , $θ \neq 0^{\circ}$ , $θ \neq 180^{\circ}$
5:: Compute linear and angular velocities:

$v_{c} = [\begin{matrix} v_{c} \\ ω_{c} \end{matrix}] = [\begin{matrix} - λ (t_{h} + {[t_{h}]}_{\times} θ u) \\ - λ θ u \end{matrix}]$

Ensure: Velocity vector

v_{c}

, with the components

(v_{x}, v_{y}, v_{z}, ω_{x}, ω_{y}, ω_{z})

The experimental results are detailed in the following section.

4. Experimental Results

This section highlights the significance of extra input-level information in a neural architecture through an experimental analysis conducted for the previously introduced approach. Using Resnet-18 as the baseline, the architecture was modified following Figure 2, with the input array defined for input images of size

227 \times 227

.

4.1. Training Setup

Using the method described in Section 3, 600 synthetic images were generated. To create all the pairs from this set, a binomial coefficient was considered, which represents the number of ways to choose k items from n options, mathematically expressed as

(\binom{n}{k}) = \frac{n!}{k! (n - k)!} .

For each pair

(I_{1}, I_{2})

, reversed order

(I_{2}, I_{1})

was also included, effectively doubling the final count of pairs to 359,400 sampling data. This procedure results in a comprehensive sampling of relative poses between any two images in the dataset, a sampling being represented by the triplet [

I_{d e s i r e d}

,

I_{c u r r e n t}

,

(v_{x}, v_{y}, v_{z}, ω_{x}, ω_{y}, ω_{z})

]. The triplets were categorized into three groups for training, validation, and testing purposes. Specifically, 70% of the data were designated to the training set, while the validation and testing sets received 15% each.

Table 1 presents the key hyperparameters and computational settings used during training. The Adam optimizer is used with a piecewise learning rate schedule and an initial rate of

1 \times 10^{- 4}

. A mini-batch size 256 per GPU is employed to leverage the parallel processing capabilities of three NVIDIA A100 Tensor Core GPUs, each equipped with 40 GB of memory. To mitigate overfitting, an L2 regularization term of

1 \times 10^{- 3}

is applied. Training progresses for 100 epochs, and the validation process is triggered once per epoch, after every

\frac{Number of Samples}{Mini - Batch Size}

iterations, with a patience of 10 for early stopping. The loss function is defined as the root mean squared error (RMSE), which is well suited for the regression objective of this work.

4.2. Offline Results

To assess the performance of the trained neural models, the mean squared error (MSE) of the output is calculated on the test samples. A recognized metric in regression tasks, the MSE measures the average squared difference between the predicted and actual velocity values. By examining the MSE across the various datasets, one can infer the model’s ability to generalize and its effectiveness in accurately estimating the target velocity vectors. The MSE over all output channels for a particular dataset is defined in (6)

M S E = \frac{1}{6 S} \sum_{k = 1}^{S} \sum_{c = 1}^{6} {(y_{c}^{t} (k) - y_{c}^{n e t} (k))}^{2},

(6)

where S denotes the number of samples in the dataset,

y_{c}^{t} (k)

is the ground truth, and

y_{c}^{n e t} (k)

represents the network’s output for the

c^{t h}

channel of the

k^{t h}

sample.

Table 2 provides the mean squared error evaluated on the test dataset, where

M_{N O E F}

stands for the model which takes ResNet-18 as baseline and does not include the early fusion approach at the neural input level, while the second model

M_{E F}

stands for the model designed by the architecture proposed in Figure 2 with SURF feature points as additional data. Both models were trained and tested on the same number of images, as explained in Section 4.1. The testing dataset consists of 53,910 pairs of desired and initial configurations.

The experiment demonstrates that additional maps assist CNNs in concentrating on significant details. Consequently, the early fusion model achieves more precise approximations, indicating the value of the extra input data for the visual servoing task. Although a lower MSE on velocity predictions indicates that the model’s outputs closely match a “ground-truth” value, this metric alone does not guarantee convergence in a real-time visual servoing task. Convergence depends not just on one-step velocity accuracy but on how velocity commands adapt over successive control iterations to drive the robot’s pose error toward zero.

4.3. Online Results

For the real-time experiments, three scenarios were evaluated, each featuring a distinct pair of configurations: an initial configuration and a corresponding desired configuration. In Figure 7, the left panels display the current (initial) image configuration, whereas the right panels show the desired configuration. Specifically, the desired configuration was acquired at the pose (

- 0.46224

m,

- 0.45865

m,

0.74603

m,

0.07905

rad,

- 0.01007

rad,

- 0.84502

rad), while the initial configuration was obtained at the pose (

- 0.47159

m,

- 0.46760

m,

0.54849

m,

- 0.02763

rad,

0.26327

rad,

- 1.26799

rad).

For each online testing scenario, the following comparisons were performed between the architecture based on early fusion and the one without early fusion: velocity convergence analysis, mean square error, median square error, and variance square error at pixel level and a pose error analysis.

4.3.1. Velocity Convergence Analysis

Figure 8 presents velocity commands for the first online scenario. As expected, the network’s performance is smoother in this familiar scenario, with smaller oscillations in both linear and angular velocities. Notably, the early fusion approach (left) reaches near-zero velocity faster than the baseline (right), indicating the benefit of incorporating feature point maps when the network has seen similar data during training. For the considered scenario,

λ

was set at 0.25. The real-time result can be found online at https://www.youtube.com/watch?v=_Voi6tL2Xcs, accessed on 15 May 2025.

Figure 9 shows results for the second online scenario (middle row in Figure 7). The early fusion method again demonstrates improved stability, producing more gradual velocity decay and fewer corrective spikes than the baseline. Although both methods eventually converge, the early fusion approach exhibits more efficient control, highlighting how additional scene geometry helps even when conditions deviate slightly from those observed in training.

Figure 10 illustrates the third scenario (bottom row in Figure 7). Here, the baseline approach displays larger amplitude corrections, particularly in the angular components, suggesting difficulties in aligning orientation accurately. In contrast, the early fusion method adapts more steadily, reducing overshoot and providing a faster path to zero velocity. This superior performance under more challenging angular variations reaffirms the benefits of incorporating feature points early in the network’s processing.

Additionally, these observations indicate that the early fusion strategy, which incorporates SURF feature point maps directly into the CNN’s first layers, facilitates robust generalization. The architecture consistently generates smoother, more stable velocity commands across different scenarios, ranging from poses near the training distribution to those with more pronounced shifts. By leveraging the extra information from SURF maps, the network adapts more effectively to unfamiliar configurations, underscoring the essential role of early fusion in enhancing model generalization.

4.3.2. Pixel-Level Metrics Analysis

To evaluate image–space alignment between the current configurations and the desired configuration, three pixel-level error metrics were computed for each image I. Let A, the reference image, and I be represented as three-dimensional arrays of size

H \times W \times 3

. We will define:

MSE (A, I_{k}) = \frac{1}{H \times W \times C} \sum_{n = 1}^{H \times W \times C} {(A - I_{k})}^{2},

(7)

MedianSqError (A, I_{k}) = median \{{(A - I_{k})}^{2} : n = 1, \dots, H \times W \times C\},

(8)

StdOfSqError (A, I_{k}) = \sqrt{\frac{1}{H \times W \times C} \sum_{n = 1}^{H \times W \times C} {[{(A - I_{k})}^{2} - \bar{{(A - I_{k})}^{2}}]}^{2}},

(9)

where MSE measures the mean squared error at the pixel level, MedianSqError is the median of all pixel-wise squared differences, and StdOfSqError reflects the standard deviation of those squared differences. While MSE is a standard regression metric, it can be disproportionately influenced by a small number of high-error pixels (e.g., due to illumination, reflections, or perspective distortions). Consequently, the median of the squared errors is more robust to outliers and provides insight into the pixel-level discrepancy. The standard deviation of the squared error will show how consistently each architecture will align the pixels between the current and desired images; a lower standard deviation error indicates fewer extreme variations among individual pixel differences.

In Equations (7)–(9), the index k denotes each current image in the sequence. Specifically,

k = 1

refers to the initial configuration image, and the highest value of k marks the final image in which the 6-DOF end-effector is closest to the desired configuration.

Figure 11 shows the evolution of MSE, median squared error, and standard deviation for the first scenario (top row of Figure 7). Although MSE provides a general sense of pixel-level discrepancy, the median squared error—being more robust to outliers—highlights how the early fusion ResNet (red) delivers more consistently aligned images than the baseline (blue). Furthermore, the reduced standard deviation indicates fewer large deviations in individual pixels, implying that most regions are matched more reliably to the reference image when feature point maps are fused at the input layer.

Figure 12 shows that despite starting from a more shifted initial pose, the early fusion method maintains the mean and the median of the squared errors at comparatively low levels versus the baseline. Additionally, its reduced standard deviation suggests a more uniform alignment of pixel regions, reinforcing that incorporating SURF feature maps at the network’s input leads to a more robust realignment of key scene elements.

Figure 13, by the nature of the analyzed scene, introduces a larger angular offset that challenges both methods. Yet, the early fusion architecture exhibits a quicker decline in MSE and a tighter spread (as indicated by the standard deviation) than the baseline. This outcome indicates that the extra geometric details provided by SURF maps remain beneficial—even under more demanding rotational conditions—by helping the model maintain a more consistent pixel-level alignment throughout the scene.

Although the velocity plots demonstrate the convergence and robustness provided by early fusion, the pixel-level results in Figure 11, Figure 12 and Figure 13 reveal that some residual errors remain in the final image alignment. This discrepancy occurs because the velocity-control loop focuses on minimizing pose errors rather than explicitly optimizing per-pixel intensity differences. Certain unmodeled factors (e.g., lighting changes, slight perspective shifts, or scene details not captured by SURF points) can lead to small image mismatches, even if the end-effector’s pose has converged successfully. Nonetheless, the early fusion design still consistently achieves lower pixel-level errors than the baseline, confirming that incorporating feature maps at the input layer helps the network better capture scene geometry and reduce visual misalignment.

4.3.3. Pose Error Analysis

Another important metric measured was the linear pose error and angular pose error for each image in the online testing sequence to further evaluate the quality of early fusion integration. Let

p_{desired} = {[x_{des}, y_{des}, z_{des}]}^{T}

denote the desired position and

p_{current} (i)

the current position at step i. Then the linear pose error is given by

e_{lin} (i) = ∥p_{desired} - p_{current} (i)∥ .

(10)

For the angular pose error, two different methods based on matrix logarithms of rotation matrices

R_{current} (i)

and

R_{desired}

are used. The following equation defines the first approach:

e_{ang}^{(1)} (i) = ∥log (R_{current} (i)) - log (R_{desired})∥,

(11)

where

log (\cdot)

is the matrix logarithm.

The following equation defines the second approach:

e_{ang}^{(2)} (i) = ∥log (R_{desired} R_{current} {(i)}^{T})∥,

(12)

which measures the logarithm of the relative rotation between the desired and current orientations.

Figure 14 illustrates the linear and angular pose errors in the first online testing scenario depicted in Figure 7—top. The ResNet model with early fusion architecture converges more quickly in position and orientation, exhibiting lower overall error across most time steps. The baseline network, by contrast, shows slower and slightly oscillatory convergence, especially in the second angular error metric.

Figure 15 presents the second real-time testing scenario analysis illustrated in Figure 7—middle. Although both methods gradually reduce linear and angular pose error with and without early fusion, the early fusion approach yields fewer oscillations in the rotational components. Its advantage is especially pronounced when using the second method for the angular error, where it settles at a smaller overall magnitude.

Figure 16 showcases the pose error analysis for the third real-time testing scenario, shown at the bottom of Figure 7. The baseline network struggles to align rotation accurately, displaying higher fluctuations throughout the trial. In contrast, the early fusion method converges with less overshoot in linear and angular components, confirming its robustness even under more challenging orientations.

Across all three scenarios, the early fusion approach systematically demonstrates lower final pose errors and fewer oscillations. This indicates that incorporating feature point maps early in the network aids in interpreting subtle geometric cues, thus providing more consistent and more reliable convergence in both position and orientation. While the baseline architecture eventually reduces errors, its slower and less stable trajectories highlight the pivotal advantage of early fusion for 6-DOF visual servoing.

5. Conclusions

This paper presents a novel hybrid deep learning approach to visual servoing control systems. As a main contribution, the approach integrates raw RGB image data with feature point maps using an early fusion architecture. This innovative methodology offers several key advantages over traditional visual servoing techniques, paving the way for more robust and efficient robotic control in dynamic environments. The seamless integration of diverse data sources is a crucial element of this work, improving performance metrics across various aspects of the visual servoing pipeline.

A contribution of this research is the creation of a comprehensive dataset combining real-world data captured using a UR5 robotic arm in an eye-in-hand configuration and a synthetically generated dataset. This dual approach addresses the limitations of relying solely on real or synthetic data. Real-world data capture the complexities and nuances of real-world scenarios, including lighting variations, occlusions, and unforeseen disturbances. However, collecting sufficient real-world data can be time-consuming, expensive, and potentially dangerous. Synthetic data, on the other hand, allow for generating vast amounts of labeled data under controlled conditions, addressing the limitations of real-world data acquisition. Combining both datasets provides a rich and diverse training ground for the deep learning models, leading to improved generalization and robustness.

Two distinct neural network architectures based on the robust ResNet framework were designed and trained using this combined dataset. One architecture directly incorporated additional feature maps into the input layer, while the other processed the RGB and feature map data separately before fusion. A comparative analysis of these architectures demonstrated the superior performance of the early fusion approach, which combines the raw RGB images and feature point maps at the earliest stage of the network’s processing. This early fusion strategy proved particularly effective in enhancing offline prediction accuracy and the convergence speed of the online servoing process. The results indicate the benefits of integrating diverse data sources to achieve superior performance in visual servoing tasks.

Looking ahead, some promising directions exist for extending this work. First, future scenarios will explore a more diverse range of pose configurations, varying significantly in position and orientation. Additionally, testing under varied environmental conditions, such as changing illumination, occlusions, and dynamic backgrounds, will further validate the algorithm’s robustness and generalization capabilities. Another interesting aspect of future work may involve refining the network to predict not just the velocity commands but also the 3D pose of the robot or camera in space. The current results aim to compare the early fusion strategy with a non-fusion baseline, demonstrating the practical benefit of incorporating supplementary feature maps for better scene generalization and robustness. Furthermore, although the existing literature relies on different datasets, we plan to compare our approach to other visual servoing methods in a future extension.

Future research directions include several extensions that could enhance current visual servoing capabilities. A particularly compelling application involves integrating our early fusion visual servoing approach with manipulation tasks, such as the work from [20]. Combining these two tasks could enable a comprehensive grasping system where the early fusion method guides robots to optimal grasping poses with improved precision. This integration would demonstrate practical applications in industrial assembly, warehouse automation, drone navigation and landing, and service robotics.

Additional research objectives include investigating alternative real-time representations, such as optical flow or depth-based maps, which may provide enhanced contextual information for the neural architecture. The network architecture could be extended to predict velocity commands and 3D pose estimation in a unified framework. While the current results validate the benefits of incorporating supplementary feature maps for improved scene generalization, future work will include comprehensive comparisons with state-of-the-art visual servoing methods using standardized benchmarks to establish the proposed approach’s performance within the broader research landscape.

Author Contributions

Conceptualization, A.-P.B. and A.B.; methodology, A.-P.B. and A.B.; software, A.-P.B. and A.-I.I.; validation, all; formal analysis, A.-P.B.; investigation, A.-P.B.; resources, all; data curation, A.-P.B.; writing—original draft preparation, A.-P.B.; writing—review and editing, all; visualization, all; supervision, A.B.; project administration, A.B.; All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by the project “Romanian Hub for Artificial Intelligence—HRIA”, Smart Growth, Digitization and Financial Instruments Program, 2021-2027, MySMIS no. 334906.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hutchinson, S.; Hager, G.D.; Corke, P.I. A tutorial on visual servo control. IEEE Trans. Robot. Autom. 1996, 12, 651–670. [Google Scholar] [CrossRef]
Chaumette, F.; Hutchinson, S. Visual servo control Part I: Basic approaches. IEEE Robot. Autom. Mag. 2006, 13, 82–90. [Google Scholar] [CrossRef]
Chaumette, F.; Hutchinson, S. Visual servo controlPart II: Advanced approaches. IEEE Robot. Autom. Mag. 2007, 14, 109–118. [Google Scholar] [CrossRef]
Chaumette, F.; Hutchinson, S.; Corke, P. Visual Servoing. In Handbook of Robotics; Springer: Berlin/Heidelberg, Germany, 2016; pp. 841–867. [Google Scholar]
Wilson, W.J.; Hulls, C.W.; Bell, G.S. Relative end-effector control using cartesian position based visual servoing. IEEE Trans. Robot. Autom. 1996, 12, 684–696. [Google Scholar] [CrossRef]
Kelly, R. Robust asymptotically stable visual servoing of planar robots. IEEE Trans. Robot. Autom. 1996, 12, 759–766. [Google Scholar] [CrossRef]
Haviland, J.; Dayoub, F.; Corke, P. Control of the final-phase of closed-loop visual grasping using image-based visual servoing. arXiv 2020, arXiv:2001.05650. [Google Scholar]
Saxena, A.; Pandya, H.; Kumar, G.; Gaud, A.; Krishna, K. Exploring convolutional networks for end-to-end visual servoing. In Proceedings of the IEEE International Conference on Robotics and Automation, Singapore, 29 May–3 June 2017; pp. 3817–3823. [Google Scholar]
Bateux, Q.; Marchand, E.; Leitner, J.; Chaumette, F.; Corke, P. Training deep neural networks for visual servoing. In Proceedings of the IEEE International Conference on Robotics and Automation, Brisbane, Australia, 21–25 May 2018; pp. 1–8. [Google Scholar]
Tokuda, F.; Arai, S.; Kosuge, K. Convolutional neural network based visual servoing for eye-to-hand manipulator. IEEE Access 2021, 9, 91820–91835. [Google Scholar] [CrossRef]
Ribeiro, E.; Mendes, R.; Grassi, V. Real-time deep learning approach to visual servo control and grasp detection for autonomous robotic manipulation. Elsevier’s Robot. Auton. Syst. 2021, 139, 103757. [Google Scholar] [CrossRef]
Mateus, A.; Tahri, O.; Miraldo, P. Active structure-from-motion for 3d straight lines. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 5819–5825. [Google Scholar]
Bista, S.R.; Giordano, P.R.; Chaumette, F. Combining line segments and points for appearance-based indoor navigation by image based visual servoing. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 2960–2967. [Google Scholar]
Azizian, M.; Khoshnam, M.; Najmaei, N.; Patel, R.V. Visual servoing in medical robotics: A survey. Part I: Endoscopic and direct vision imaging–techniques and applications. Int. J. Med Robot. Comput. Assist. Surg. 2014, 10, 263–274. [Google Scholar] [CrossRef]
Mathiassen, K.; Glette, K.; Elle, O.J. Visual servoing of a medical ultrasound probe for needle insertion. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation (ICRA), Stockholm, Sweden, 16–21 May 2016; pp. 3426–3433. [Google Scholar]
Zettinig, O.; Frisch, B.; Virga, S.; Esposito, M.; Rienmüller, A.; Meyer, B.; Hennersperger, C.; Ryang, Y.M.; Navab, N. 3D ultrasound registration-based visual servoing for neurosurgical navigation. Int. J. Comput. Assist. Radiol. Surg. 2017, 12, 1607–1619. [Google Scholar] [CrossRef]
Hashimoto, K.; Kimoto, T.; Ebine, T.; Kimura, H. Manipulator control with image-based visual servo. In Proceedings of the 1991 IEEE International Conference on Robotics and Automation, Sacramento, CA, USA, 9–11 April 1991; pp. 2267–2268. [Google Scholar]
Thuilot, B.; Martinet, P.; Cordesses, L.; Gallice, J. Position based visual servoing: Keeping the object in the field of vision. In Proceedings of the 2002 IEEE International Conference on Robotics and Automation (Cat. No. 02CH37292), Washington, DC, USA, 11–15 May 2002; Volume 2, pp. 1624–1629. [Google Scholar]
Chen, S.; Wen, J.T. Industrial robot trajectory tracking control using multi-layer neural networks trained by iterative learning control. Robotics 2021, 10, 50. [Google Scholar] [CrossRef]
Brașoveanu, F.A.; Iancu, A.I.; Botezatu, A.P. 3-D Vision-Based Workspace with Deep Learning Capabilities for Autonomous Robot Manipulation. In Proceedings of the MTM&Robotics: Joint International Conference of the International Conference on Mechanisms and Mechanical Transmissions and the International Conference on Robotics, Iași, Romania, 14–16 November 2024; Volume 178, p. 244. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Adv. Neural Inf. Process. Syst. 2012, 60, 1097–1105. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar]
Dosovitsky, A.; Fischery, P.; Ilg, E.; Hazirbas, C.; Golkov, V.; van der Smagt, P.; Cremers, D.; Brox, T. Flownet: Learning optical flow with convolutional networks. In Proceedings of the International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2758–2766. [Google Scholar]
He, Y.; Gao, J.; Chen, Y. Deep learning-based pose prediction for visual servoing of robotic manipulators using image similarity. Neurocomputing 2022, 491, 343–352. [Google Scholar] [CrossRef]
Qu, K.; Liang, W.; Odesanmi, G.A.; Iqbal, I. Task space robotic manipulation based on revised virtual decomposition plus PD control. In Proceedings of the 2019 IEEE 9th Annual International Conference on CYBER Technology in Automation, Control, and Intelligent Systems (CYBER), Suzhou, China, 29 July–2 August 2019; pp. 1245–1250. [Google Scholar]
Iqbal, I.; Younus, M.; Walayat, K.; Kakar, M.U.; Ma, J. Automated multi-class classification of skin lesions through deep convolutional neural network with dermoscopic images. Comput. Med Imaging Graph. 2021, 88, 101843. [Google Scholar] [CrossRef]
Jin, P.; Lin, Y.; Song, Y.; Li, T.; Yang, W. Vision-force-fused curriculum learning for robotic contact-rich assembly tasks. Front. Neurorobotics 2023, 17, 1280773. [Google Scholar] [CrossRef] [PubMed]
Kaufmann, E.; Bauersfeld, L.; Loquercio, A.; Müller, M.; Koltun, V.; Scaramuzza, D. Champion-level drone racing using deep reinforcement learning. Nature 2023, 620, 982–987. [Google Scholar] [CrossRef]
Botezatu, A.P.; Ferariu, L.E.; Burlacu, A. Enhancing Visual Feedback Control through Early Fusion Deep Learning. Entropy 2023, 25, 1378. [Google Scholar] [CrossRef]
Bay, H.; Ess, A.; Tuytelaars, T.; Gool, L.V. SURF: Speeded Up Robust Features. In Proceedings of the 9th European Conference on Computer Vision, Graz, Austria, 7–13 May 2006. [Google Scholar]
La Anh, T.; Song, J.B. Robotic grasping based on efficient tracking and visual servoing using local feature descriptors. Int. J. Precis. Eng. Manuf. 2012, 13, 387–393. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Gadzicki, K.; Khamsehashari, R.; Zetzsche, C. Early vs late fusion in multimodal convolutional neural networks. In Proceedings of the 2020 IEEE 23rd International Conference on Information Fusion (FUSION), Rustenburg, South Africa, 6–9 July 2020; pp. 1–6. [Google Scholar]
Huang, X.; Ma, T.; Jia, L.; Zhang, Y.; Rong, H.; Alnabhan, N. An effective multimodal representation and fusion method for multimodal intent recognition. Neurocomputing 2023, 548, 126373. [Google Scholar] [CrossRef]
Dubrofsky, E. Homography Estimation. Master’s Thesis, University of British Columbia, Vancouver, BC, Canada, 2009. [Google Scholar]
Agarwal, A.; Jawahar, C.; Narayanan, P. A survey of planar homography estimation techniques. Cent. Vis. Inf. Technol. Tech. Rep. IIIT/TR/2005/12 2005. [Google Scholar]
DeTone, D.; Malisiewicz, T.; Rabinovich, A. Deep image homography estimation. arXiv 2016, arXiv:1606.03798. [Google Scholar]

Figure 1. Example of original images (left) and their corresponding points of interest maps (right).

Figure 3. Real-time application setup.

Figure 4. System architecture for visual servoing using ROS.

Figure 5. Example of reference image used for synthetic generation.

Figure 6. Illustration of synthetic viewpoint generation using homography-based warping.

Figure 7. Online testing scenarios considered for the VS task.

Figure 8. Velocity analysis in the first online testing scenario. Left: Data obtained with early fusion. Right: Data obtained without early fusion.

Figure 9. Velocity analysis in the second online testing scenario. Left: Data obtained with early fusion. Right: Data obtained without early fusion.

Figure 10. Velocity analysis in the third online testing scenario. Left: Data obtained with early fusion. Right: Data obtained without early fusion.

Figure 11. Comparison of pixel-level error metrics for the first online testing scenario.

Figure 12. Comparison of pixel-level error metrics for the second online testing scenario.

Figure 13. Comparison of pixel-level error metrics for the third online testing scenario.

Figure 14. Comparison of pose error analysis for the first online testing scenario.

Figure 15. Comparison of pose error analysis for the second online testing scenario.

Figure 16. Comparison of pose error analysis for the third online testing scenario.

Table 1. Summary of training parameters.

Parameter	Value and Information
Training optimizer	Adam
Mini-batch size	256 × number of GPUs
GPU setup	3 x NVIDIA A100 Tensor Core GPU with 40 GB
Number of epochs	100
Initial learning rate	$1 \times 10^{- 4}$
L2 regularization	$1 \times 10^{- 3}$
Learning rate schedule	Piecewise
Learning rate drop factor	0.1
Learning rate drop period	30
Validation frequency	$\frac{Number of Samples}{Mini - Batch Size}$
Validation patience	10
Loss function	Root mean squared error

Table 2. MSE for offline testing dataset (

\times 10^{- 6}

).

Table 2. MSE for offline testing dataset (

\times 10^{- 6}

).

CNN	$v_{x}$ [m/s $\times 10^{- 6}$ ]	$v_{y}$ [m/s $\times 10^{- 6}$ ]	$v_{z}$ [m/s $\times 10^{- 6}$ ]	$ω_{x}$ [rad/s $\times 10^{- 6}$ ]	$ω_{y}$ [rad/s $\times 10^{- 6}$ ]	$ω_{z}$ [rad/s $\times 10^{- 6}$ ]
Testing
$M_{N O E F}$	$0.7369$	$0.8008$	$0.4929$	$6.1049$	$4.9577$	$3.2307$
$M_{E F}$	$0.6116$	$0.6623$	$0.3741$	$5.0881$	$3.6200$	$2.5350$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Botezatu, A.-P.; Iancu, A.-I.; Burlacu, A. Hybrid Deep Learning Framework for Eye-in-Hand Visual Control Systems. Robotics 2025, 14, 66. https://doi.org/10.3390/robotics14050066

AMA Style

Botezatu A-P, Iancu A-I, Burlacu A. Hybrid Deep Learning Framework for Eye-in-Hand Visual Control Systems. Robotics. 2025; 14(5):66. https://doi.org/10.3390/robotics14050066

Chicago/Turabian Style

Botezatu, Adrian-Paul, Andrei-Iulian Iancu, and Adrian Burlacu. 2025. "Hybrid Deep Learning Framework for Eye-in-Hand Visual Control Systems" Robotics 14, no. 5: 66. https://doi.org/10.3390/robotics14050066

APA Style

Botezatu, A.-P., Iancu, A.-I., & Burlacu, A. (2025). Hybrid Deep Learning Framework for Eye-in-Hand Visual Control Systems. Robotics, 14(5), 66. https://doi.org/10.3390/robotics14050066

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hybrid Deep Learning Framework for Eye-in-Hand Visual Control Systems

Abstract

1. Introduction

2. Hybrid Deep Learning Visual Servoing

2.1. Feature Point Maps

2.2. Cnn-Based Visual Servoing Control

3. Dataset Configuration

4. Experimental Results

4.1. Training Setup

4.2. Offline Results

4.3. Online Results

4.3.1. Velocity Convergence Analysis

4.3.2. Pixel-Level Metrics Analysis

4.3.3. Pose Error Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI